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Technical  Objectives  and  Issues 

The  goals  of  StarLite  project  are  twofold.  First,  we  investigate  new  technologies  associated  with  the 
construction  of  real-time,  distributed  operating  systems,  database  systems,  and  communication  networks. 
Secondly,  the  research  will  produce  tools  and  experimental  systems,  using  the  StarLite  software  prototyp¬ 
ing  environment.  The  issues  being  addressed  arc  priority-based  scheduling,  methodology  for  developing 
prototyping  tools,  system  integration,  operating  system  support  and  interface  issues,  and  fault-tolerance  in 
real-time  computing. 

Approach 

Our  research  is  directed  towards  discovering  a  set  of  design  principles  and  developing  efficient 
algorithms  for  distributed  real-time  operating  systems  and  database  systems.  In  addition  to  theoretical 
study,  we  arc  also  developing  experimental  systems  and  prototyping  tools  for  performance  evaluation  of 
new  technology. 

Accomplishments  and  Significance 

One  of  the  most  important  achievements  in  FY91  is  the  development  of  new  scheduling  algorithms 
based  on  the  idea  of  adjusting  the  serialization  order  of  active  transactions  dynamically.  This  is  the  first 
successful  attempt  to  integrate  benefits  of  pessimistic  and  optimistic  approaches  for  transac.ion  schedul¬ 
ing.  When  compared  with  conventional  transaction  scheduling  algorithms  (c.g.,  two-pha.>r  locking),  our 
algorithms  significantly  improve  the  percentage  of  high  priority  transactions  that  meet  ti  e  deadline.  We 
also  have  developed  priority-ordered  deadlock  avoidance  algorithms  for  real-time  re  urce  management, 
and  replication  control  algorithms  for  distributed  real-time  databases.  These  algorithms  will  be  very 
efficient  for  distributed  real-time  systems,  in  which  replicated  resources  should  be  managed  to  support 
consistency,  while  satisfying  timing  constraints.  Using  StarLite  prototyping  environment,  we  have  imple¬ 
mented  those  algorithms  and  demonstrated  that  they  provide  higher  level  of  concurrency  and  greater 
flexibility  in  meeting  liming  requirements. 

We  have  developed  a  new  paradigm  for  multiprocessor  real-time  systems,  and  implemented  a  paral¬ 
lel  programming  interface  based  on  our  paradigm.  Our  new  paradigm  has  created  new  research  opportuni¬ 
ties  for  operating  systems  and  databases  for  parallel  computing  systems  with  timing  constraints.  For 
example,  using  the  new  programming  interface,  we  have  developed  PRDB,  an  experimental  real-time 
database  system  that  runs  on  an  emulated  tightly-coupled,  shared-memory  multiprocessor  system  in  the 
StarLite  environment.  It  provides  a  general  paradigm  foi  cxlpoiting  parallelism  and  different  real-time 
scheduling  policies.  This  experimental  system  has  been  used  for  investigating  implementation  techniques 
for  parallel  database  systems  and  the  impact  of  multiprocessor  technology  on  operating  systems  design. 
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Wc  have  developed  a  suite  of  database  systems  on  several  platforms,  such  as  StarLite,  ARTS,  and 
UNIX,  and  utilized  them  as  system  integration  testbeds.  S  ce  a  real-time  system  must  operate  in  the  con¬ 
text  of  operating  system  services,  correct  functioning  and  timing  behavior  of  the  system  depends  heavily 
on  the  operating  system  interfaces.  We  have  developed  a  multi-thread  database  server,  called  RTDB,  for 
ARTS  real-time  operating  system  kernel.  The  RTDB  now  supports  application  programmatic  interface 
and  graphic  user  interface.  Our  experimental  systems  achieve  other  goal  of  this  project — to  transfer  tech¬ 
nology  developed  under  the  StarLite  project  to  Navy,  DoD,  and  other  research  organizations.  Currently, 
Naval  Ocean  Systems  Center  in  San  Diego,  California,  is  using  RTDB  for  their  distributed  real-time 
experiments. 

Plans  for  FY92 

Wc  will  continue  to  expand  the  module  hierarchy  of  the  StarLite  environment  by  including  modules 
for  optimistic  scheduling  and  system  recovery,  and  graphic  user  interface.  We  will  also  investigate  sys¬ 
tem  support  requirements  for  real-time  applications,  and  then  evaluate  the  StarLite  from  that  perspective. 
In  addition,  we  will  refine  the  current  single  processor  and  multiprocessor  implementation.  Furthermore, 
wc  will  begin  work  on  a  distributed  version  StarLite  operating  system  that  will  be  integrated  with  our  dis¬ 
tributed  database  kernel. 

Wc  will  investigate  scheduling  and  concurrency  control  algorithms,  and  perform  experiments  for 
their  evaluation.  Wc  will  also  implement  replication  control  algorithms  and  recovery  algorithms  in  the 
StarLite  for  their  performance  evaluation.  With  respect  to  experimental  database  systems,  wc  will  add 
real-time  transaction  features  and  evaluate  them  for  realistic  applications.  Our  findings  and  new  technol¬ 
ogy  developed  in  this  project  will  continue  to  be  transfered  to  other  Navy  organizations. 

Presentations,  Publications,  and  Honors 

•  Presentations 

•  Cook,  The  StarLite  Project,  University  of  North  Carolina  at  Charlotte. 

•  Cook,  The  StarLite  Project,  University  of  Hawaii. 

•  Cook,  The  StarLite  Project,  Microsoft  Corporation. 

•  Son,  Real-Time  Database  Systems,  NOSC  Code  413  DC^  Quarterly  Review. 

•  Son,  Real-Time  Database  Systems,  Sogang  University. 

•  Son,  Prototyping  Approach  to  Database  Research,  Electronics  and  Telecommunications  Research 
Institute. 

•  Son,  Scheduling  Real-Time  Transactions,  Seoul  National  University. 

•  Book  Chapters 

(1 )  Cook,  R.  P.,  "The  StarLite  Operating  System,"  Operating  Systems  for  Mission-Critical  Computing. 
K.  Gordon,  P.  Hwang,  and  A.  Agrawala  (Editors),  ACM  Press,  1991. 

(2)  Cook,  R.  P.,  "Modula-2,"  Encyclopedia  of  Computer  Science,  (to  appear). 

(3)  R.  Cook.  L.  Hsu,  and  S.  H.  Son,  “Real-Time,  Priority-Ordered,  Deadlock  Avoidance  Algorithms," 
in  Foundations  of  Real-Time  Computing:  Scheduling  and  Resource  Management,  A.  Van  Tilborg 
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and  G.  M.  Koob  (Editors),  Kluwcr  Academic  Publishers,  1991. 

(4)  S.  H.  Son,  “On  Priority-based  Synchronization  Protocols  for  Distributed  Real-Time  Databa”  sc 
Systems,"  in  Distributed  Databases  in  Real-Time  Control,  E.  Knuth  and  M.  Rodd  (Editors),  Per- 
gamon  Press,  1990. 

(5)  S.  H.  Son,  Y.  Lin,  and  R.  Cook,  “Concurrency  Control  in  Real-Time  Database  Systems,”  in  Foun¬ 
dations  of  Real-Time  Computing:  Scheduling  and  Resource  Management,  A.  Van  Tilborg  and  G. 
M.  Koob  (Editors),  Kluwcr  Academic  Publishers,  1991. 

(6)  S.  H.  Son,  R.  Cook,  J.  Lee,  and  H.  Oh,  “New  Paradigms  for  Real-Time  Database  Systems,”  in 
Real-Time  Programming,  K.  Ramamritham  and  W.  Halang  (Editors),  Pcrgamon  Press,  (to  appear). 

(7)  S.  H.  Son  and  S.  Park,  “Scheduling  Transactions  for  Distributed  Time-Critical  Applications,”  in 
Advances  in  Distributed  Systems,  T.  Casavant  and  M.  Singhal  (Editors),  IEEE  Computer  Society, 
(to  appear). 

•  Refereed  Publications 

(1)  S.  H.  Son,  “Reconstruction  of  Distributed  Databases,”  Computer  Systems  Science  and  Engineer¬ 
ing,  vol.  5,  no.  4,  October  1990. 

(2)  P.  Shcbalin,  S.  H.  Son,  and  C.  Chang,  “An  Approach  to  Software  Safety  Analysis  in  Distributed 
Systems,”  Computer  Systems  Science  and  Engineering,  vol.  6,  no.  2,  April  1991. 

(3)  L.  Sha,  R.  Rajkumar,  S.  H.  Son,  and  C.  Chang,  “A  Real-Time  Locking  Protocol,”  IEEE  Transac¬ 
tions  on  Computers,  vol.  40,  no.  7,  July  1991. 

(4)  S.  H.  Son,  “An  Environment  for  Integrated  Development  and  Evaluation  of  Real-Time  Dislri” 
buted  Database  Systems,"  Journal  of  Systems  Integration,  (to  appear). 

(5)  S.  H.  Son,  J.  Rainer,  S.  Chiang,  “StarBase:  A  Simulation  Laboratory  for  Distributed  Database 
Research,”  Journal  of  Computer  Simulation,  (to  appear). 

(6)  Cook,  R.  P.  and  H.  Oh,  "The  StarLite  Project,"  Frontiers  90  Conference  on  Massively  Parallel 
Computation,  October  1990. 

(7)  Y.  Lin  and  S.  H.  Son,  “Concurrency  Control  in  Real-Time  Databases  by  Dynamic  Adjustment  of 
Scrializ,”  ation  Order,”  Ilth  IEEE  Real-Time  Systems  Symposium,  Orlando,  Florida,  December 
1990. 

(8)  S.  H.  Son,  C.  Iannaconc,  and  R.  Bcckingcr,  “Integrating  Databases  with  Real-Time  Computing 
Systems,”  IEEE Southeastcon  '91 ,  Williamsburg,  Virginia,  April  1991. 

(9)  S.  H.  Son,  M.  Poris,  and  C.  Iannaconc,  “Implementing  a  Distributed  Real-Time  Database 
Manager,”  The  Second  International  Symposium  on  Database  Systems  for  Advanced  Applications 
(DASFAA  ‘91),  Tokyo,  Japan.  April  1991. 

(10)  H.  Kang  and  S.  H.  Son,  “A  Hierarchical  Export/Import  Scheme  for  Data  Sharing  in  a  Federated 
Distnbu”  ted  Database  System,"  The  Second  International  Symposium  on  Database  Systems  for 
Adv ......  d  Applications  (DASFAA  '91),  Tokyo,  Japan,  April  1991. 
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(11)  S.  H.  Son,  P.  Waglc,  and  S.  Park,  “Real-Time  Database  Scheduling:  Design,  Implementation,  and 
Performance  Evalu”  ation,"  The  Second  International  Symposium  on  Database  Systems  for 
Advanced  Applications  (DASFAA  ’91),  Tokyo,  Japan,  April  1991. 

(12)  R.  P.  Cook,  S.  H.  Son,  H.  Y.  Oh,  and  J.  Lee,  “New  Paradigms  for  Real-Time  Database  Systems,” 
8th  IEEE  Workshop  on  Real-Time  Operating  Systems  and  Software,  Atlanta,  Georgia,  May  1991. 

(13)  S.  H.  Son,  C.  Iannaconc,  and  M.  Poris,  “RTDB:  A  Real-Time  Database  Manager  for  Time-Critical 
Applications,"  Euromicro  Workshop  on  Real-Time  Systems,  Paris,  France,  June  1991. 

(14)  S.  H.  Son  and  S.  Chiang,  “Concurrent  Checkpointing  Algorithms  for  Distributed  Database  Sys¬ 
tems,”  Fourth  International  Conference  on  Parallel  and  Distributed  Computing  and  Systems, 
Washington,  DC,  October  1991  (to  appear). 

(15)  S.  H.  Son,  S.  Park,  and  Y.  Lin,  “An  Integrated  Real-Time  Locking  Protocol,”  Eighth  IEEE  Inter¬ 
national  Conference  on  Data  Engineering,  Phoenix,  Arizona,  February  1992  (to  appear). 

•  Honors  and  Recognition 

•  Cook,  Program  Committee,  Eighth  IEEE  Workshop  on  Real-Time  Software  and  Operating  Systems, 
Atlanta,  Geotgia  (1991). 

•  Son,  ACM  Distinguished  Lecturer,  1991  —  1993. 

•  Son,  Chair,  Technical  Activities  Committee,  Korean  Computer  Scientists  and  Engineers  Association. 

•  Son,  Program  Committee,  IEEE  Symposium  on  Reliable  Distributed  Systems,  Huntsville,  Alabama 
(1990). 

•  Son,  Program  Committee,  ACM  SIGMOD  International  Conference  on  Management  of  Data,  Denver, 
Colorado  (1991). 

•  Son,  Program  Committee,  IEEE  Workshop  on  Real-Time  Operating  Systems  and  Software,  (1992). 

•  Son,  Program  Committee,  International  Workshop  on  Transaction  and  Query  Processing,  (1992). 

•  Son,  Program  Committee,  International  Symposium  on  Database  Systems  for  Advanced  Applications, 
(1993). 

•  Son,  Panelist,  International  Conference  on  Very  Large  Data  Bases  (VLDB  ’91),  on  the  panel  "Real- 
Time  Databases,"  (1991). 

•  Son,  Session  Chair,  Ninth  IEEE  Symposium  on  Reliable  Distributed  Systems,  (1990). 

•  Son,  Session  Chair,  12th  IEEE  Real-Time  Systems  Symposium,  ( 1991 ). 

•  Son,  Invited  Paper,  “A  Prototyping  Approach  to  Distributed  Database  Research,”  Database  Review, 
vol.  6,  October  1990. 

•  Son,  Invited  Paper,  “Real-Time  Database  Systems:  A  New  Challenge,”  Data  Engineering,  Special 
Issue  on  Directions  for  Future  Database  Research  and  Development,  vol.  1 3,  no.  4,  December  1990. 


•  Son,  Invited  Paper,  “Techniques  for  Database  Recovery  in  Distributed  Environments,”  Management 
and  Organization  of  Automation,  Kluwer  Bcdrijfswctcnschappen,  January  1991. 

•  Students  and  Post-Docs 

Seog  Park,  (Post-Doc),  real-time  transaction  scheduling 

Marc  Poris  (Research  Associate),  database  integration  with  real-time  kernel 

Shi-Chin  Chiang  (Ph.D.  student),  checkpointing  in  distributed  systems 

Lee  Hsu  (Ph.D  student),  priority-based  resource  management 

Yingfcng  Oh  (Ph.D.  student),  real-time  multiprocessor  operating  systems 

Juhnyoung  Lee  (Ph.D.  student),  schedulers  for  real-time  databases 

Young-Kuk  Kim  (Ph.D.  student),  interface  issues  for  real-time  databases 

Ambar  Sarkar  (Ph.D.  student),  real-time  fault-tolerant  network  protocols 

Prasad  Waglc  (M.S.  student),  dynamic  priority  scheduling 

Yi  Lin  (M.S.  student),  priority-based  contention  protocols 

Carmen  Iannaconc  (M.S.  student),  multi-thread  real-time  database  server 

Robert  Bcckingcr  (M.S.  student),  support  for  temporal  information 

Sprios  Kouloumbis  (M.S.  student),  replication  control 

Savita  Shamsundcr  (M.S.  student),  optimistic  concurrency  control  protocols 

Stavros  Yannopolous  (M.S.  student),  experimental  database  manager 

Fengjie  Zhang  (M.S.  student),  majority  consensus  for  real-time  databases 

David  Bake  (M.S.  student),  distributed  real-time  transaction  processing 

Richard  McDaniel(B.S.  student),  prototyping  environment 

•  Number  and  Types  of  Degrees  Granted 
M.S.  degrees  granted  to  6  students. 
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Concurrency  Control  in  Real-Time  Databases 
by  Dynamic  Adjustment  of  Serialization  Order 


Yi  Lin  and  Sang  H.  Son 

Department  of  Computer  Science 
University  of  Virginia 
Charlottesville,  Virginia  22903 


ABSTRACT 

Time-critical  scheduling  in  real-time  database 
systems  has  two  components:  real-ume  transaction 
scheduling,  which  is  related  to  task  scheduling  in  real¬ 
time  operating  systems,  and  concurrency  control, 
which  can  be  considered  as  operauon  level  scheduling. 
Most  current  research  in  this  area  only  focuses  on  the 
transaction  scheduling  aspact  while  the  concurrency 
control  part  remains  untouched.  In  the  paper,  a  new 
concurrency  control  algorithm  for  real-time  database 
systems  is  proposed,  by  which  real-time  scheduling  and 
concurrency  control  can  be  integrated.  The  algorithm 
is  based  on  a  priority-based  locking  mechanism  to  sup¬ 
port  time-critical  scheduling  by  adjusting  the  serializa¬ 
tion  order  dynamically  in  favor  of  high  priority  transac¬ 
tions.  Furthermore,  it  does  not  assume  any  knowledge 
about  the  data  requirements  or  execution  time  of  each 
transaction.  This  makes  the  algorithm  very  practical. 


1.  Introduction 

Compared  with  traditional  databases,  the  ability 
to  meet  the  deadlines  of  transactions  is  vital  to  a  real¬ 
time  database.  In  other  words,  the  timeliness  of  results 
can  be  as  important  as  their  correctness  in  real-time 
database  systems.  Deadlines  constitute  the  timing  con¬ 
straints  of  transactions.  A  deadline  is  said  to  be  hard  if 
it  cannot  be  missed  or  else  the  result  is  useless.  If  a 
deadline  can  be  missed,  it  is  a  soft  deadline.  With  soft 
deadlines,  the  usefulness  of  a  result  may  decrease  after 
the  deadline  is  missed.  Another  important  characteris¬ 
tic  of  real-ume  transacuons  is  criticality,  which 
represents  the  importance  of  a  transaction.  Deadline 
and  criticality  are  two  orthogonal  factors  that  play 
important  roles  in  real-time  database  systems  (Son88). 
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The  goal  of  scheduling  in  real-time  database  sys¬ 
tems  is  twofold;  to  meet  timing  constraints  and  to 
enforce  data  consistency.  In  real-time  operating  sys¬ 
tems,  scheduling  is  usually  at  the  task  level  subject  only 
to  timing  constraints.  Data  consistency  constraints  are 
not  involved.  In  conventional  database  systems,  meet¬ 
ing  the  deadline  is  typically  not  addressed.  Scheduling 
is  at  operation  level  in  that  the  basic  unit  of  a  schedule 
is  operation.  The  only  goal  of  such  scheduling  is  to 
guarantee  data  consistency.  Scheduling  in  the  real-time 
database  systems  is  a  combination  of  the  two  schedul¬ 
ing  mechanisms  [Son90c], 

Real-time  task  scheduling  methods  can  be 
extended  for  real-time  transaction  scheduling  while 
concurrency  control  protocols  are  still  needed  for 
operation  scheduling  to  maintain  data  consistency. 
However,  the  integration  of  the  two  mechanisms  in 
real-time  database  systems  is  not  trivial.  The  general 
approach  is  to  uulize  existing  concurrency  control  pro¬ 
tocols,  especially  2PL.  and  to  apply  time-critical  tran¬ 
saction  scheduling  methods  that  favor  more  urgent 
transactions  [Abb88,  Sha90,  Son89b).  Such 
approaches  have  the  inherent  disadvantage  of  being 
limited  by  the  concurrency  control  method  upon  which 
they  are  based,  since  all  existing  concurrency  control 
methods  synchronize  concurrent  data  access  of  transac¬ 
tions  by  the  combination  of  two  measures:  blocking 
and  roll-backs  of  transactions.  Both  are  barriers  to 
time-critical  scheduling.  The  conservative  two-phase 
locking  (2PL)  protocol  [Bem87]  and  the  opumisuc 
methods  [Bok87,  Kung81]  are  examples  of  the  two 
extremes.  In  real-time  database  systems,  blocking  may 
cause  priority  inversion  when  a  high  priority  transac¬ 
tion  is  blocked  by  lower  priority  transactions  iSbaSS). 
The  alternative  is  to  abort  the  low  priority  transacuons 
if  they  block  a  high  priority  transaction.  This  wastes 
the  work  done  by  the  aborted  transacuons  and  in  turn 
also  has  a  negative  effect  on  time-critical  scheduling. 

Concurrency  control  protocols  induce  a  serializa¬ 
tion  order  among  contacting  transactions.  In  non-real¬ 
time  concurrency  control  protocols,  ummg  constraints 
arc  not  a  factor  in  the  construction  of  this  order.  This  is 
obviously  a  drawback  for  real-time  databxse  systems. 
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for  example,  with  the  2PL  method,  the  serialization 
order  is  dynamically  constructed  and  corresponds  to 
tf#  order  in  which  the  conflicting  transactions  access 
shared  data  objects.  In  other  words,  the  serialization 
order  is  bound  to  the  past  execution  history  with  no 
flexibility-  When  a  transaction  TH  with  a  higher  prior¬ 
ity  requests  an  exclusive  lock  which  is  being  held  by 
another  transaction,  TL,  with  a  lower  priority,  the  only 
choices  are  either  aborting  TL  or  letting  TH  wait  for  TL. 
Neither  choice  is  satisfactory  and  thus  the  performance 
is  degraded. 

Based  on  the  argument  that  timing  constraints 
may  be  more  important  than  data  consistency  in  real¬ 
time  database  systems,  attempts  have  been  made  to 
satisfy  the  timing  constraints  by  sacrificing  database 
consistency  temporarily  to  some  degree  [Kor90,  Lin89, 
Vrb88].  It  is  based  on  a  new  consistency  model  of 
real-time  databases,  in  which  maintaining  external  data 
consistency  has  priority  over  maintaining  internal  data- 
consistency.  Although  in  some  applications  weaker 
consistency  is  acceptable  [Gar83],  a  general-purpose 
consistency  cntenon  that  is  less  stringent  than  senaliza- 
bility  has  not  yet  been  proposed.  The  problem  is  that 
temporary  inconsistencies  may  affect  active  transac¬ 
tions  and  so  the  commitment  of  these  transactions  may 
still  need  to  be  delayed  until  the  inconsistencies  are 
removed:  otherwise  even  committed  transactions  may 
need  to  be  rolled  back.  However,  in  real-time  systems, 
some  actions  are  not  reversible.  In  addition,  incorrect 
data  may  spread  within  the  database.  This  makes 
inconsistency  removal  a  very  difficult  task.  Before  any 
breakthrough  is  made  in  this  direction,  serializability 
seems  to  be  the  only  correctness  cntenon  for  us  to  live 
with. 

Satisfying  the  liming  constraints  while  preserv¬ 
ing  data  consistency  requires  the  concurrency  control 
algorithms  to  accommodate  timeliness  of  transactions 
as  well  as  to  maintain  data  consistency.  This  is  the 
very  goal  of  our  work.  In  real-time  database  systems, 
timeliness  of  a  transaction  is  usually  combined  with  its 
criticality  to  take  the  form  of  the  priority  of  that  tran¬ 
saction.  Various  ways  of  assigning  priority  and  their 
effects  have  been  discussed  in  [Stan88,  Hua89, 
Buch89). 

For  a  concurrency  control  algorithm  to  accom¬ 
modate  the  timeliness  of  transactions,  the  serialization 
order  it  produces  should  reflect  the  priority  of  transac¬ 
tions  (Son90b|.  However,  this  is  often  hindered  by  the 
past  execution  history  of  transactions.  For  example,  a 
higher  priority  transaction  may  have  no  way  to  precede 
a  lower  priority  transacuon  in  the  serialization  order 
due  to  previous  conflicts.  The  result  is  that  either  the 
lower  priority  transacuon  has  to  be  aborted  or  the  high 
pnonty  transaction  suffers  blocking.  If  the  information 
about  data  requirements  and  execuuon  ume  of  each 


transaction  is  available  beforehand,  off-line  preanalysis 
can  be  performed  to  avoid  conflicts  [Sha90],  This  is 
exactly  what  is  done  in  many  real-time  task  scheduling 
protocols.  However,  such  approach  may  have  to  delay 
the  starting  of  some  transactions,  even  if  they  have  high 
priorities,  and  may  reduce  the  concurrency  level  in  the 
system.  This,  in  return,  may  lead  to  the  violauon  of  the 
timing  constraints  and  degrade  the  system  performance 
[Son90], 

What  we  need  is  a  concurrency  control  algorithm 
that  allows  transacuons  to  meet  the  ummg  constraints 
as  much  as  possible  without  reducing  the  concurrency 
'evel  of  the  system  in  the  absence  of  any  a  prion  infor¬ 
mation  [Son89].  The  algorithm  presented  in  this  paper 
features  such  ability.  It  has  the  flavor  of  both  locking 
and  optimistic  methods.  Transactions  write  into  the 
database  only  after  they  are  committed.  By  using  a 
priority-dependent  locking  protocol,  the  senaiizauon 
order  of  active  transactions  is  adjusted  dynamically, 
making  it  possible  for  transacuons  with  higher  priori¬ 
ties  to  be  executed  first  so  that  higher  pnonty  transac¬ 
tions  are  never  blocked  by  uncommitted  lower  pnonty 
transacuons,  while  lower  pnonty  transactions  may  not 
have  to  be  aborted  even  in  face  of  contlicung  opera¬ 
tions.  The  adjustment  of  the  se.ializauon  order  can  be 
viewed  as  a  mechanism  to  support  ume-cnucal 
scheduling.  For  example,  Tx  and  T2  are  two  transac¬ 
tions  with  T j  having  a  higher  priority.  T 2  wntes  a  data 
object  x  before  T:  reads  it.  In  2PL,  even  in  the  absence 
of  any  other  contl  icting  operations  between  these  two 
transactions,  F,  has  to  either  abort  7\  or  be  blocked 
until  T2  releases  the  write  lock.  That  is  because  the 
serialization  order  T2-*T [  is  already  determined  by  the 
past  execution  history.  T [  can  never  precede  T2  in  the 
serialization  order.  In  our  method,  when  such  conflict 
occurs,  the  serialization  order  of  the  two  transacuons 
will  be  adjusted  in  favor  of  T\,  i.e.  TX—>T2,  and  nei¬ 
ther  is  T[  blocked  nor  is  T2  aborted.  In  addition,  the 
locking  protocol  in  'he  algorithm  is  free  from 
deadlocks. 

The  remainder  of  this  paper  is  organ. zed  as  fol¬ 
lows.  The  details  of  the  proposed  algorithm  are 
described  in  the  next  section.  The  transaction  chedul- 
ing  aspect  of  the  algorithm  is  discussed  in  Section  3. 
Section  4  presents  an  informal  argument  on  the  correct¬ 
ness  of  the  method.  Secuon  5  presents  an  example  to 
show  how  the  algorithm  works.  Finally  ^oncluam-e 
remarks  appear  in  Section  b. 

2.  The  Proposed  Algorithm 

The  environment  we  assume  tor  the  i 
uon  is  a  single  processor  with  randomly  arris  me 
sac  Cions.  Each  transaction  is  assigned  an  muim  :r.  r.r, 
and  a  start-timestamp  when  it  is  submitted  to  the  s\s- 
tcm.  The  initial  pnonty  can  be  based  on  the  deadline 


and  the  criticality  of  the  transaction.  The  start- 
timestamp  is  appended  to  the  initial  priority  to  form  the 
actual  priority  that  is  used  in  scheduling.  When  we 
refer  to  the  priority  of  a  transaction,  we  always  mean 
the  actual  priority  with  the  start-timestamp  appended. 
Since  the  start-timestamp  is  unique,  so  is  the  priority  of 
each  transaction.  The  priority  of  transactions  with  the 
same  initial  priority  is  distinguished  by  their  start- 
umestamps. 

All  transactions  that  can  be  scheduled  are  place 
in  a  ready  queue,  R_Q.  Only  transactions  in  R_Q  are 
scheduled  for  execution.  When  a  transaction  is 
blocked,  it  is  removed  from  R_Q.  When  a  transaction 
is  unblocked,  it  is  inserted  into  RQ  again,  but  may  still 
be  waiting  to  be  assigned  tne  CPU.  A  transaction  is 
said  to  be  suspended  when  it  is  not  executing,  but  still 
in  R_Q.  When  a  transaction  is  doing  I/O  operation,  it  is 
blocked.  Once  it  completes,  it  is  usually  unblocked. 
We  will  discuss  the  CPU  scheduling  policy  in  the  next 
section. 

The  execution  of  each  transacuon  is  divided  into 
three  phases:  the  read  phase,  the  wait  phase  and  the 
write  phase.  This  is  similar  to  the  optimistic  methods. 
During  the  read  phase,  a  transaction  reads  from  the 
database  and  writes  to  its  local  workspace.  After  it 
completes,  it  waits  for  its  chance  to  commit  in  the  wait 
phase.  If  it  is  commuted,  it  switches  into  the  write 
phase  during  which  all  its  updates  are  made  permanent 
in  the  database.  A  transaction  in  any  of  the  three  phases 
is  called  active.  If  an  active  transaction  is  in  the  write 
phase,  then  it  is  committed  and  writing  into  the  data¬ 
base.  The  proposed  algorithm  takes  an  approach  of 
integrated  schedulers  in  that  it  uses  2PL  for  read-wnte 
conflicts  and  the  Thomas’  Write  Rule  (TWR)  for 
write-write  conflicts.  The  following  is  the  oudine  of  a 
transaction: 

transaction  =  {  tbegini); 

read  phase; 
lwait( ); 
twritel); 

}■ 

All  the  procedures  will  be  defined  later  in  this  section. 

In  our  algorithm,  there  are  various  data  structures 
that  need  to  be  read  and  updated  in  a  consistent 
manner.  Therefore  we  use  critical  sections  of  various 
classes  to  group  the  various  data  structures  to  allow 
maximum  concurrency.  We  also  assume  that  each 
assignment  statement  of  global  data  is  executed  atomi¬ 
cally.  The  following  are  some  useful  notations: 


id\ 

readjrset : 
wait  jr  set: 
write  trset : 


id  of  this  transaction 

set  of  ids  of  transactions  in  the  read  phase 
set  of  ids  of  transactions  in  the  wait  phase 
set  of  ids  of  transactions  in  the  write 
phase 


tsenr. 

ts: 

ts(T): 

prioriry(T): 

r,  U): 

w,  [  x ): 

pwtlx]: 

rlock(Tjc): 

wlockfT  jc): 


final-timestamp  count  of  the  system 
final-timestamp  of  this  transaction 
final-timestamp  value  of  transaction  T 
priority  value  of  transaction  T 
transaction  i  reads  data  object  x. 
transaction  i  wntes  data  object  x. 
transaction  i  prewntes  data  object  x. 
transacuon  T  holds  a  read  lock  on  data 
object  x 

transacuon  T  holds  a  write  lock  on  data 
object  x 

begin  critical  section  of  class  i 
end  critical  section  of  class  i 


2.1.  Read  Phase 

The  read  phase  is  the  normal  execuuon  of  the 
transaction  except  that  write  operations  are  performed 
on  the  private  data  copies  in  the  local  workspace  of  the 
transaction  instead  of  on  the  data  objects  in  the  data¬ 
base.  We  call  such  wnte  operations  prewrite.  One 
advantage  of  this  prewnte  operauon  is  that  when  a 
transaction  is  aborted,  all  that  has  to  be  done  for 
recovery  is  to  simply  discard  the  data  in  us  local 
workspace.  No  rollback  is  needed  because  no  changes 
have  been  made  in  the  database. 

The  read-prewnte  or  prewnte-read  conflicts 
between  active  transactions  are  synchronized  during 
this  phase  by  a  pnority-based  locking  protocol.  Before 
a  transaction  can  perform  a  read  (resp.  prewnte)  opera¬ 
tion  on  a  data  object,  it  must  obtain  the  read  (resp. 
wnte)  lock  on  Lhat  data  object  first.  If  a  transacuon 
reads  a  data  object  that  has  been  wntten  by  itself,  it 
gets  the  private  copy  in  its  own  workspace  immediately 
and  no  read  lock  is  needed.  In  the  rest  of  the  paper, 
when  we  refer  to  read  operations,  we  exclude  such  read 
operations  because  they  do  not  incur  any  dependency 
among  transactions. 

Each  lock  contains  the  pnority  of  the  transaction 
holding  the  lock  as  well  as  other  usual  information  such 
as  the  lock  holder  id  and  the  lock  type,  etc.  The  lock¬ 
ing  protocol  is  based  on  the  principle  that  higher  prior¬ 
ity  transactions  should  complete  before  lower  priority 
transactions.  That  means  if  two  transactions  conflict, 
the  higher  priority  transacuon  should  precede  the  lower 
priority  transaction  in  the  senali/auon  order.  With  our 
CPU  scheduling  policy,  which  will  be  described  in  the 
next  section,  a  high  priority  transacuon  is  scheduled  to 
commit  before  a  low  priority  transacuon  most  of  the 
time.  If  a  low  priority  transaction  does  complete  before 
a  high  priority  transacuon,  it  is  required  to  wail  until  it 
is  sure  lhat  its  commitment  will  not  cause  the  abortion 
of  a  higher  priority  transacuon.  Since  iransucuons  do 
not  wnte  into  the  database  during  the  read  phase, 
write-write  conflicts  need  not  be  considered  here. 
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Suppose  active  transaction  F,  has  higher  priority 
than  active  transaction  T2 .  We  have  the  following  four 
possibilities  of  conflict  and  the  transaction  dependen¬ 
cies  they  set  in  the  serialization  order: 

(1)  rT,  [*  1 .  P”tM\  =>  T\  -»  T2 


(2)  pwT{[x\ ,  rTt[x]  =>  r,->7\ 

(delayed  reading) 
or 

T  2  -»  T. 

(immediate  reading) 


(3)  rTi  [x  ]  ,  pwT<  [x }  => 


7\ 


(4)  (x  ] ,  rT>  [x  ]  =>  T i  ->  7" 2 

(immediate  reading) 
or 

7\  -4  f. 

(delayed  reading) 


Case  (1)  meets  the  pnnciple  of  completing  high  priority 
transactions  before  low  priority  ones.  In  case  (2),  fol¬ 
lowing  our  principle,  we  should  choose  delayed  read¬ 
ing,  i.e.  T2  should  not  read  x  until  T t  has  committed 
and  written  x  in  the  database.  Case  (3)  violates  our 
principle.  In  this  case,  unless  it  is  already  committed, 
T2  is  usually  aborted  because  otherwise  T2  must  com¬ 
mit  before  T,  and  thus  will  block  Tx.  However,  if  T2 
has  already  finished  its  work,  i.e.  in  the  wait  phase,  we 
should  avoid  aborting  it  because  aborting  a  transaction 
which  has  completed  its  work  imposes  a  considerable 
penalty  on  the  the  system  performance.  In  the  mean¬ 
time,  we  still  do  not  want  F,  to  be  blocked  by  T2. 
Therefore  when  such  conflict  occurs  and  T2  is  in  the 
wait  phase,  we  do  not  abort  T2  until  Tx  is  committed, 
hoping  that  T2  may  get  a  chance  to  commit  before  F, 
commits.  In  case  (4),  if  T2  is  already  committed  and  in 
the  write  phase,  we  should  delay  F !  so  that  it  reads  x 
after  T2  writes  it.  This  blocking  is  not  a  serious  prob¬ 
lem  for  T i  because  T2  is  already  in  the  write  phase  and 
is  expected  to  finish  writing  x  soon.  T t  can  read  x  as 
soon  as  T2  finishes  writing  x  in  the  database,  not  neces¬ 
sarily  after  T2  completes  the  whole  write  phase.  There¬ 
fore  F i  will  not  be  blocked  for  a  long  time.  Otherwise, 
if  T2  is  not  committed  yet,  i.e.  either  in  the  read  phase 
or  in  the  wait  phase,  T,  should  read  x  immediately 
because  that  is  in  accordance  with  the  principle. 


As  transactions  are  being  executed  and 
conflicting  operauons  occur,  all  the  information  about 
the  induced  dependencies  in  the  serialization  order 
needs  to  be  retained.  To  do  this,  we  associate  with  each 
transaction  two  sets,  before  jrset  and  after jrset,  and  a 
count.  before_cnt.  The  set  before  jrset  (resp. 
after  jrset)  contains  all  the  acuve  lower  priority  tran¬ 
sactions  that  must  precede  (resp.  follow)  this  transac¬ 
tion  in  the  serialization  order,  before  jni  is  the  number 


of  the  higher  priority  transactions  that  precede  this  tran¬ 
saction  in  the  serialization  order.  When  a  conllict 
occurs  between  two  transactions,  their  dependency  is 
set  and  their  values  of  before  jrset,  after  jrset,  and 
before _cnt  will  be  changed  correspondingly. 

By  summarizing  what  we  discussed  above,  we 
define  the  locking  protocol  as  follows: 

LP1.  Transaction  T  requests  a  read  lock  on  data 
object  x. 

<0<i  for  te  {T,  \wlock(T,,x)  a  Ti*T}  do 
if  (priority  (t)  >  priority  ( T ) 

v  te  write  jrset)  f*  Case  2.  4  */ 
then  deny  the  lock  and  exit; 

endif 

enddo 

for  te  (T,\wlock(TL,x)  a  Ti*T)  do 
/*  Case  4  */ 

if  (t  e  before  jrsetj)  then  abort  t; 
else  if  ( te  after  jrsetf ) 

then 

after  trsetr  after  jrsetT<j  ( t}  ; 

before  jnt,  :=  before  _cnt,  +  I ; 

endif 

endif 

enddo 

grant  the  lock: 

1  >  o> 


LP2.  Transaction  F  requests  a  write  lock  on  data 
object  x. 

<0<2<3  for  te  {T,\rlock(T,,x)  /sTjT}  do 
if  (priority  (l)  >  priority  (T) ) 
then  /*  Case  1  */ 
if  (Te  after  jrset,) 

then 

after  jrset,  :=  after  jrset, <{T); 
before _cntT  :  =  before _cntT  +  l ; 

endif 

2>  3> 

else 

if  ft  6  wait  jrset)  .  *  Case  J  * 

then 

if  ft  e  after  jrsetj)  then  abort 
else 

before  t rsetj-  .=  before  tr  , 

endif 

else  if  ft  €  read jrsen 
then  abort  t, 
endif 
endif 
endif 

grant  the  lock ; 
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The  critical  sections  of  class  0  guarantee  that 
lock  requests  are  processed  sequentially,  probabl  by  a 
lock  manager.  LP1  and  LP2  are  actually  two  pro¬ 
cedures  of  the  lock  manager  that  are  executed  when  a 
lock  is  requested.  When  a  lock  is  denied  due  to  a 
conflicting  lock,  the  request  is  suspended  until  that 
conflicting  lock  is  released.  Then  the  locking  protocol 
is  invoked  once  again  from  the  very  beginning  to 
decided  whether  the  lock  can  be  granted  now.  Fig.  1 
shows  the  lock  compaubility  tables  in  which  the  com- 
patiblities  are  expressed  by  possible  actions  taken  when 
conflicts  occur.  The  compaubility  depends  on  the  pnor- 
iues  of  the  transacuons  holding  and  requesting  the  lock 
and  the  phase  of  the  lock  holder  as  well  as  the  lock 
types.  Even  with  the  same  lock  types,  different  actions 
may  be  taken,  depending  on  the  priorities  of  the  lock 
holder  and  the  lock  requester.  Therefore  a  table  entry 
may  have  more  than  one  blocks  retlccung  the  different 
possible  acuons. 


lock 

requested 

lock 

held 

read 

write 

raid 

mm 

write 

lock 

requested 

_ lock  held _ 1 

read 

*Tite  | 

read 

write 

_J 

lock  requester  has  lower  priority 


Jock  requester  has  higher  priority 


□  Jock  granted 
EH  lock  requester  blocked 


O  U'ck  requester  aborted 

■I  lock  holder  aborted 


Fig.  1  Lock  Compatibility  Table 


Transactions  are  released  for  execution  as  soon 
as  they  arrive.  The  following  procedure  is  executed 
when  a  transaction  is  started: 

tbegin  =  ( 

before  _tr set  :=  0; 

after  jr set  :  =  0; 

before _cnt  :  =  0; 

readjrset .  =  read jr  set  u  (id). 

R_Q  .  =  R_Q  kj  {id}; 

) . 

Then  the  transaction  is  in  the  read  phase.  When  it  tries 
to  read  or  prewnte  a  data  object,  it  requests  the  lock. 
The  lock  may  be  granted  or  not  according  to  the  lock¬ 
ing  protocol.  Transacuons  may  be  aborted  when  lock 
requests  are  processed.  To  abort  a  transacuon,  the  fol¬ 
lowing  procedure  is  called: 

tabort  =  ( 
release  all  locks; 

<2for  t  e  after  jrset  do 

before _cnt,  :=  before _cnt,-\; 
if  (before  j;nt,  =  0  a  l  e  watt  jrset) 
then  unblock  t: 
endif 
enddo 
2> 

if  (id  6  read  jrset) 

then  readjr  et  .  =  readjrset  -  {id}; 

else  if  (id  e  w  rite  jrset) 

then  write  jrset  :=  write  jrset  ■  {id}; 
else  if  nd  €  wan  jrset) 

then  wait  jrset  =  wait  jrset  -  {id}. 
endif 

endif 

endif 

)  ■ 


Note  that  a  data  object  may  be  both  read  locked 
and  write  locked  by  several  transactions  simultaneously 
with  our  locking  protocol.  Unlike  2PL,  locks  are  not 
classified  simply  as  shared  locks  and  exclusive  locks. 
Fig.  2  summarizes  the  lock  compatibility  of  2PL  with 
the  High  Priority  scheme  in  which  high  priority  tran¬ 
sactions  never  block  for  a  lock  held  by  a  low  priority 
transaction  [ Abb88).  By  comparing  Fig.  1  with  Fig.  2. 
it  is  obvious  that  our  locking  protocol  is  much  more 
flexible,  thus  incurs  less  blocking  and  abort.  Note  that 
in  Fig.  1,  the  abort  of  lower  priority  transactions  in  the 
wait  phase  is  also  included.  In  our  locking  protocol,  a 
high  priority  transaction  is  never  blocked  or  aborted 
due  to  conflict  with  an  uncommitted  lower  priority 
transaction.  The  probabi  1  it>  ot  aborting  a  lower  priority 
transaction  should  be  less  dun  that  in  2PL  under  the 
same  condiuons.  An  analytical  model  may  he  used  to 
esumute  the  exact  probability,  but  that  is  beyond  the 
ycope  of  this  paper. 


lock  requester  has  lower  priority 


lock  requester  lus  higher  priority 


1=3  tvk 

’  ■  h.  k 


5  r  in  U.*  d 

requester  f,i>  v  *  q 


n.M.icr  urwvJ 


Fig.  2  Lock  Compatibility  Table  ot  2 PL 
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The  critical  section  of  class  2  in  the  procedure  also 
appears  in  LP2.  This  ensures  the  mutual  exclusion  on 
after jrset.  To  be  precise,  mutual  exclusion  is  only 
needed  between  LP2  and  the  procedure.  Transactions 
can  be  in  the  critical  section  of  the  procedure  simul¬ 
taneously,  because  each  transaction  in  the  procedure 
tabort  only  access  its  own  after  jrset. 

2.2.  Wait  Phase 

The  wait  phase  allows  a  transaction  to  wait  until 
it  can  commit.  A  transaction  T  can  commit  only  if  all 
transactions  with  higher  priorities  that  must  precede  it 
in  the  serialization  order  are  either  committed  or 
aborted.  Since  before _cnt  is  the  number  of  such  tran¬ 
sactions,  T  can  commit  only  if  its  before_cnt  becomes 
zero.  A  transaction  in  the  wait  phase  may  be  aborted 
due  to  two  reasons.  The  first  one  is  that  since  T  is  not 
committed  yet  and  still  holding  all  the  locks,  by  the 
locking  protocol  it  may  be  aborted  due  to  a  conflicting 
lock  request  by  a  higher  priority  transaction.  The 
second  reason  is  the  commitment  of  a  higher  priority 
transaction  that  must  follow  T  in  the  serialization  order. 
When  such  a  transaction  commits,  it  finds  T  in 
before  jrset  and  aborts  T.  Once  a  transaction  in  the 
wail  phase  gets  its  chance  to  commit,  i.e.  its  before jent 
goes  to  zero,  it  switches  into  the  write  phase  and 
release  all  its  read  locks.  A  final-timestamp  is  assigned 
to  it,  which  is  the  absolute  serialization  order.  The  pro¬ 
cedure  is  as  follows: 

twait  =  < 

wait  jrset  :=  wait  jrset  cj  {id}; 
read  jrset  :=  read  jrset  -  {id}; 
watting  ;=  TRUE; 
w  hilef warring,)  do 

<i  if  (before  _cnt  =  0)  /*  if  can  commit  */ 

then  /*  switching  into  write  phase  */ 
wait  jrset  :=  wail  jrset  -  {id}; 
write  jrset  :=  write  jrset  u  {id}; 
ts  .  =  tsent; 
tsent  :  =  tsent  +  l ; 
for  r  e  before  jrset  do 
if  (t€  read  jrset  v  re  wait  jrset ) 
then  abort  t; 
endif 
enddo 

i> 

waiting  :=  F ALSE 
else  block; 
endif 

enddo 

release  all  read  locks; 

<3  for  r  e  after  jrset  do 

if  ire  read  jrset  v  re  wait  jrset ) 
then  before_c.nl,  :=  before  _cnt,-\: 

if  t  before  cm,  =  l)  a  r  e  wait  jrset) 
then  unblock  t: 


endif 

endif 

enddo 

3  > 

)■ 

After  a  transaction  commits,  all  the  transactions 
in  its  before  jrset  need  to  be  aborted  because  they  must 
commit,  if  they  can,  before  this  transaction.  The  criti¬ 
cal  section  of  class  1  in  the  procedure  guarantees  that 
transactions  cannot  switch  into  the  write  phase  con¬ 
currently,  and  once  a  transaction  is  committed  and 
assigned  a  final-timestamp,  no  transacuon  in  its 
before  jrset  can  commit  Note  that  LP1  is  also  in  the 
critical  section  of  the  same  class.  This  achieves  mutual 
exclusion  on  before _cnt  and  write  jrset.  The  critical 
section  of  class  3  in  the  procedure  has  the  same  effect 
as  that  of  the  critical  section  in  the  procedure  tabort. 

2.3.  Write  Phase 

Once  a  transaction  is  in  the  write  phase,  it  is  con¬ 
sidered  to  be  committed.  All  committed  transactions 
can  be  serialized  by  the  final-timestamp  order.  In  the 
write  phase,  the  only  work  of  a  transacuon  is  making 
all  its  updates  permanent  in  the  database.  Data  items 
are  copied  from  the  local  workspace  into  the  database. 
After  each  write  operation,  the  corresponding  write 
lock  is  released.  The  Thomas’  Write  Rule  (TWR)  is 
applied  here.  The  write  requests  of  each  transaction  are 
sent  to  the  data  manager,  which  carries  out  the  write 
operations  in  the  database.  Transactions  submit  write 
requests  along  with  their  final-timestamps.  The  write 
procedure  is  as  follows: 

twrite  =  ( 

<4  for  xe  {  x,  I  wlock  ( id,x ,) }  do 
for  T  e  write  jrset  do 

if  (  wlock(Tjc)  a  ls(T)  <  tst  id) ) 
then  release  T  s  write  lock  on  x: 

endif 

enddo 

send  write  request  on  x  and  wait  for 
acknowledgement; 

4  > 

if  (acknowledgement  is  ok) 
then  release  the  write  lock  on  x. 
else  abort; 
endif 
enddo 

R_Q  :=  R_Q  -  {id}; 

)  . 

The  purpose  of  the  critical  section  is  to  .lchieve 
mutual  exclusion  on  write  locks.  For  each  data  object, 
write  requests  are  sent  to  the  data  manager  enlv  in 
ascending  timestamp  order.  After  a  a.nte  request  on 
data  object  x  with  timestamp  n  is  issued  to  the  da  a 
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manager,  no  other  write  request  on  x  with  a  timestamp 
smaller  that  n  will  be  sent.  The  write  requests  are  buf¬ 
fered  by  the  data  manager.  The  data  manager  can  work 
with  the  first-come-first-serve  policy  or  always  select 
the  write  request  with  the  highest  priority  to  process. 
When  a  new  request  arrives,  if  there  is  another  buffered 
write  request  on  the  same  data  object,  the  request  with 
the  smaller  umestamp  is  discarded.  Therefore  for  each 
data  object  there  is  at  most  one  write  request  in  the 
buffer.  This,  in  conjunction  with  the  procedure  [write, 
guarantees  TWR. 

3.  CPU  Scheduling 

Although  the  focal  point  of  this  paper  is  on  con¬ 
currency  control,  i.e.  operation  level  scheduling,  we 
still  need  to  discuss  a  little  about  the  transaction 
scheduling,  or  CPU  scheduling,  aspect  of  our  algo¬ 
rithm.  In  non-real-ume  database  systems,  CPU 
scheduling  is  usually  done  by  the  underlying  operating 
systems,  because  there  are  no  timing  constraints.  Data 
consistency  is  the  only  concern.  In  real-time  database 
systems,  however,  CPU  scheduling  should  take  into 
account  the  timeliness  of  transacuons. 

In  our  protocol,  R_Q  contains  all  transactions 
that  can  be  scheduled.  These  transactions  can  be  in  any 
phase.  We  need  a  policy  to  determine  the  CPU  schedul¬ 
ing  priority  for  transactions  in  different  phases.  Tran¬ 
sactions  in  their  wait  phase  are  those  that  have  finished 
their  work  and  are  waiting  for  their  chances  to  commit 
We  would  like  to  avoid  aborting  such  transactions  as 
much  as  possible.  Therefore  transactions  in  this  phase 
are  given  higher  CPU  scheduling  priority  than  those  in 
the  read  phase  so  that  they  can  commit  as  soon  as  they 
get  the  chance.  Transactions  in  the  read  phase  are 
scheduled  according  to  their  assigned  priority.  If  there 
are  several  read  phase  transactions  in  the  R_Q,  the  one 
with  the  highest  priority  is  always  selected  to  execute. 

For  transactions  in  the  wait  phase,  the  lower  the 
priority  is,  the  higher  the  CPU  scheduling  priority  is. 
Since  low  priority  transactions  are  more  vulnerable  to 
conflicts,  if  there  is  a  chance,  they  should  be  committed 
as  soon  as  possible  to  avoid  being  aborted  later.  More¬ 
over,  when  a  high  priority  transaction  TH  is  committed, 
it  may  have  to  abort  a  low  priority  transaction  TL  if  TL 
is  in  TVs  before jrset.  If  TL  is  also  ready  to  commit 
and  we  allow  it  to  commit  before  IH,  both  IL  and  rH 
can  be  committed. 

4.  Correctness  of  the  Algorithm 

In  this  section,  we  give  an  informal  argument  on 
the  correctness  of  the  algorithm.  We  will  also  show 
that  the  algorithm  is  free  from  deadlocks.  First,  we 
give  the  simple  definitions  of  htstom  and  serialization 
graph  (SG).  For  the  formal  definitions,  see  (Bem87).  A 
history  is  a  partial  order  of  operations  that  represents 


the  execution  of  a  set  of  transactions.  Any  two 
conflicting  operations  must  be  comparable.  Let  H  be  a 
history.  The  serialization  graph  for  H,  denoted  by 
SG(/T),  is  a  directed  graph  whose  nodes  are  committed 
transactions  in  H  and  whose  edges  are  all  7j  -*Tt  (i*j) 
such  that  one  of  7j’s  operations  precedes  and  conflicts 
with  one  of  Tj’s  operations  in  H.  To  prove  a  history  H 
serializable,  we  only  have  to  prove  that  SG (H)  is  acy¬ 
clic  [Bem87]. 

Let  T i  and  7"2  be  two  committed  transactions  in 
a  history  H  produced  by  the  algorithm.  We  argue  that 
if  there  is  an  edge  Tx  -*T2  in  SG  (If),  then 
ts (T [)  <  ts (T2).  Since  Tx  —*T2,  The  two  must  have 
conflicting  operations.  There  are  three  cases. 

Case  1 :  w t  [x  ]  — »  w2 [x  ] 

Suppose  ts(T2)  <  ts(T{).  Therefore  7\  enters 
into  the  wnte  phase  before  T,.  If  w,[x]  is  sent 
to  the  data  manager  first,  TVs  wnte  lock  on  x 
must  be  released  before  W]  [x  ]  is  sent  to  the  data 
manager  in  [write! ).  If  w2[x]  is  sent  to  the  data 
manager  first,  it  will  either  be  processed  before 
W|[x]  is  sent  to  the  data  manager,  or  be  dis¬ 
carded  when  the  data  manager  receives  w,[x], 
because  w2[x]  has  a  smaller  umestamp.  There¬ 
fore  vv,fx]  is  never  processed  before  w2[x). 
Such  conflict  is  impossible.  A  contradicuon. 

Case  2:  r,[x]  — *  w2(x] 

If  T2  holds  write  lock  on  x  when  T ,  requests  the 
read  lock,  we  must  have 
priority  (7  .)  >  priority  (7\)  and  T2  is  not  in  the 
wnte  phase,  because  otherwise  Tx  would  have 
been  blocked  by  LPl.  By  LP1, 
T2e  after  jrseitj  x).  T2  will  not  switch  into  the 
write  phase  before  T ,  does,  because  before  _cntTt 
cannot  be  zero  with  T j  still  in  the  read  or  wait 
phase.  Therefore  ts(Tx)  <  ts(T2).  If  T,  holds 
read  lock  on  x  when  T2  requests  the  wnte  lock, 
b>  LP2,  we  have  either  7\  €  after _trsetTi  or 
T ,  e  before _trsetT),  depending  on  the  pnonties 
of  the  two  transacuons.  In  either  case,  T:  must 
commit  before  T  2.  Hence  we  also  have 
is  ( T , )  <  ts  ( r2). 

Case  3:  w .  (x  ]  ->/•-(  t ) 

Since  r:  is  already  in  the  wnte  pha>e  before  i: 
reads  x,  we  must  have  i  /'  >  <  ts  i  T:  i. 

Suppose  there  is  a  ^.cle 

/',  — >  T 2  — »  •  •  •  — »  T,  — »  /  |  in  S GilPi.  By  the  above 
argument,  we  have  ts(T: )  <  rri  T:  i  < 

<  ts(Tn)  <  ts(T').  This  is  impossible.  Therefore  no 
cycle  can  exist  in  SG  (If)  and  thus  the  algorithm  only 
produces  senali/able  histones. 
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In  the  algorithm,  a  high  priority  transaction  can 
be  blocked  by  a  low  priority  transaction  only  if  the  low 
priority  transaction  is  in  the  write  phase.  Suppose  there 
is  a  cycle  in  the  wait-for  graph  (WFG), 
T]  -*T  2  -»  •••  —*7,  —>7',.  For  any  edge  T,  — »  TJ  in 
the  cycle,  if  priority  (T,)  >  priority  (T,),  T;  must  be  in 
the  write  phase,  thus  it  cannot  be  blocked  by  any  other 
transactions  and  cannot  appear  in  the  cycle.  Therefore 
we  must  have  priority  (TJ  <  priority  (T;)  and  thus 
priority  (T  t)  <  priority  (T 2)  <  <  priority  {TJ  < 

priority  (T\).  This  is  impossible.  Hence  a  deadlock  can¬ 
not  exisL 

The  strictness  of  the  histones  produced  by  the 
algorithm  follows  obviously  from  the  fact  that  a  tran¬ 
saction  applies  the  results  of  its  wnte  operations  from 
its  local  workspace  into  the  database  only  after  it  com¬ 
mits.  This  property  makes  transacuon  recovery  pro¬ 
cedure  simpler  than  non-stnct  concurrency  control 
algonthms. 

5.  An  Example 

In  this  section,  we  give  a  simple  example  to 
show  how  the  algonthm  works.  The  example  is  dep¬ 
icted  in  Fig.  3.  A  solid  line  at  a  low  level  indicates  that 
the  corresponding  transaction  is  doing  I/O  operation 
due  to  a  page  fault  or  in  the  wnte  phase.  A  dotted  line 
at  a  low  level  indicates  that  the  corresponding  transac¬ 
tion  is  either  suspended  or  blocked,  and  not  doing  any 
I/O  operation  either.  A  line  raised  to  a  higher  level 
indicates  that  the  transaction  is  executing.  The  absence 
of  a  line  indicates  that  the  transacuon  has  not  yet 
arrived  or  has  already  completed. 
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Fig.  3  An  Example 


There  are  three  transactions  in  the  example.  T , 
has  the  highest  priority  and  T 3  has  the  lowest.  T3 
arrives  at  time  r0  and  reads  data  object  a.  This  causes  a 
page  fault.  After  the  I/O  operation,  it  pre-wntes  b.  Then 
T2  comes  in  at  time  t\  and  preempts  7Y  At  time  t2  it 
reads  c  and  causes  another  page  fault.  So  it  is  blocked 
for  the  I/O  opcrauon  and  T 3  executes.  After  T 3  pre- 


wntes  d,  T2  finishes  I/O  and  preempts  T3  again.  It  pre¬ 
writes  d  which  is  only  write  locked  by  73.  At  time  r3, 
7\  arrives  and  preempts  T2.  Tx  first  reads  d,  which  is 
write  locked  by  both  T2  and  T3.  Therefore. 
before  trsetj'  becomes  [T2,  T3}  and  both  before  _cntT% 
and  before  cntr,  become  1.  Then  T{  reads  b,  which  is 
write  locked  by  T3.  Since  T3  is  already  m 
before jrsetT< ,  nothing  is  changed.  Then  T,  pre-wntes 
b  and  pre-writes  d.  Since  these  two  data  objects  are  not 
read  locked  by  any  other  transactions,  the  wnte  locks 
are  granted  to  T t  directly.  At  ume  t4,  T,  switches  into 
the  wnte  phase.  Both  before _cntji  and  before  cntT<  go 
back  to  0.  Now  T 2  should  be  executed,  but  it  needs  to 
read  b,  which  is  being  write  locked  by  T, ;  hence  T 3  is 
executed  instead.  It  reads  c,  which  is  read  locked  by 
T2.  At  time  r5,  7",  finishes  wnung  b  and  releases  the 
write  lock  so  that  T2  can  preempt  T3  to  continue  its 
work.  It  reads  b,  which  is  wnte  locked  by  T3.  Now 
before  jrsetTt  becomes  (f  3 )  and  before  _cntT< 
becomes  1.  After  T2  pre-wntes  b,  it  switches  into  the 
write  phase  and  before_cruTt  becomes  0  again.  Then 
T3  executes  and  also  switches  into  wnte  phase  after 
pre-writing  c. 

In  the  above  example.  T , ,  which  is  supposed  to 
be  the  most  urgent  transaction,  finishes  first  although  it 
is  the  last  to  arrive.  T3,  which  is  supposed  to  be  the 
least  urgent  one,  is  the  last  one  to  commit.  None  of  the 
three  transactions  need  to  be  aborted.  Assume  we  use 
2PL  in  the  above  example.  When  a  high  pnonty  tran¬ 
saction  requests  a  lock  which  is  held  by  a  low  pnonty 
transaction,  we  either  let  the  high  prionty  transaction  to 
wait  or  abort  the  low  priority  transaction.  Suppose  we 
choose  the  first  alternative,  both  T ,  and  T 2  would  be 
blocked  by  T3  because  T3  holds  a  write  lock  on  d.  If 
we  choose  the  second  alternative,  T3  will  be  aborted  by 
T2  when  T2  pre-writes  d  and  then  T2  will  be  aborted 
by  rt  when  T\  reads  d.  This  example  illustrates  the 
advantage  of  the  proposed  metliod  over  2PL. 

6.  Conclusions 

Time-critical  scheduling  in  real-time  database 
systems  consists  of  two  scheduling  mechanisms:  tran¬ 
saction  scheduling  and  operation  scheduling.  To  find 
new  concurrency  control  methods  in  which  timing  con¬ 
straints  of  transactions  are  taken  into  account,  we  have 
investigated  solutions  to  the  operation  scheduling 
aspect  of  time-critical  scheduling. 

In  this  paper,  a  pnoruy-based  concurrency  con¬ 
trol  method  for  real-time  database  systems  is  presented 
which  employs  a  pnonty-dependent  locking  mechan¬ 
ism.  It  works  under  the  condition  that  no  information 
about  data  requirements  or  execution  ume  of  each  tran¬ 
saction  is  available.  By  delaying  the  write  operauons 
of  transactions,  the  restraint  of  past  transacuon  execu¬ 
tion  on  the  senalizauon  order  is  relaxed,  allowing  the 
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serialization  order  among  transactions  to  be  adjusted 
dynamically  in  compliance  with  transaction  timeliness 
and  criticality.  The  new  algorithm  features  the  ability 
that  allows  transactions  to  meet  the  timing  constraints 
as  much  as  possible  without  reducing  the  concurrency 
level  of  the  system  or  increasing  the  restart  rate 
significantly.  In  the  algorithm,  high  priority  transac¬ 
tions  are  never  blocked  by  an  uncommitted  lower  prior¬ 
ity  transaction,  while  low  priority  transacuons  may  not 
have  to  be  aborted  even  in  face  of  conflict  with  high 
priroity  transactions.  In  conjunction  with  a  time- 
critical  transaction  scheduling  policy  (.CPU  scheduling 
policy)  discussed  in  Section  3,  the  proposed  algorithm 
is  expected  to  improve  the  system  performance 
significantly. 
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1.  Introduction 

Real-time  database  systems  (RTDBS)  are  database  systems  where  transactions  have  liming  con¬ 
straints  such  as  deadlines.  The  correctness  of  the  system  depends  not  only  on  the  logical  results  but  also 
on  the  time  within  which  the  results  are  produced.  In  RTDBS,  transactions  must  be  scheduled  in  such  a 
way  that  they  can  be  completed  before  their  corresponding  deadlines  expire.  For  example,  both  the  update 
and  query  in  the  tracking  data  for  a  mission  must  be  processed  within  given  deadlines. 

Conventional  database  systems  are  typically  not  used  in  real-time  applications  due  to  poor  perfor¬ 
mance  and  lack  of  predictability.  In  other  words,  paradigms  used  in  conventional  database  systems  are 
not  suitable  in  real-time  database  systems  (Son90).  To  address  this  problem,  we  have  been  investigating 
new  database  technology  and  paradigms  for  real-time  systems  using  both  theoretical  as  well  as  experi¬ 
mental  approaches.  They  can  be  grouped  into  the  following  research  tasks:  (1)  investigating  new  proto¬ 
cols  for  transaction  scheduling,  concurrency  control,  and  checkpointing,  and  (2)  developing  experimental 
database  systems  that  can  provide  real-time  features  over  conventional  relational  databases.  New  schedul¬ 
ing  and  concurrency  control  protocols  developed  in  the  first  task  are  being  implemented  in  the  experimen¬ 
tal  database  systems  and  the  prototyping  environment  for  performance  evaluation. 

Our  research  effort  in  the  area  of  real-time  transaction  scheduling  has  resulted  in  two  new  protocols: 
one  based  on  locking  [Lin90]  and  the  other  on  timestamp  ordering.  In  the  area  of  experimental  database 
systems,  we  have  been  developing  a  suite  of  database  systems  on  several  platforms.  Currently,  our 
research  utilizes  the  UNIX,  StarLite  [Cook90],  and  ARTS  operating  systems  [Tok89],  Experimental  data¬ 
base  systems  we  have  developed  on  these  platforms  are  the  Multi-user  Real-time  Database  (MRDB), 
Parallel  Real-time  Database  (PRDB),  and  Real  Time  Database  (RTDB),  respectively  [Son91],  All  three 
systems  are  based  on  the  relational  paradigm.  Much  of  our  development  consists  of  implementing  new 
functionality  on  the  most  appropriate  platform,  and  where  applicable,  porting  the  result  to  one  of  the  oth¬ 
ers.  In  this  paper,  we  outline  the  scheduling  protocol  based  on  timestamp  ordering  and  our  experience 
with  PRDB  development 

2.  An  Optimistic  Concurrency  Control  for  Real-Time  Transaction  Scheduling 

In  real-time  transaction  scheduling,  the  actual  execution  order  of  operations  is  determined  by  two 
factors:  priority  order  and  serialization  order  among  transactions  in  system.  The  difficulties  in  real-time 
transaction  scheduling  arise  from  the  fact  that  these  two  factors  have  different  natures  and  are  constructed 
in  different  wavs.  While  serializable  execution  order  is  strictly  bound  to  the  past  execuuon  history,  the 
priority  order  does  not  reflect  the  past  execution  history  and  may  dynamically  destroy  the  order  set  up  in 
the  past  execution,  hence  serializability.  By  identifying  the  effects  of  the  interactions  between  senaliza- 
tion  order  and  priority  order  in  scheduling  real-time  transactions,  we  can  build  more  intelligent  conflict 
resolution  schedulers. 

One  approach  to  real-time  transaction  scheduling  is  to  make  the  priority  order  and  serialization 
order  compatible  as  much  as  possible  in  order  to  increase  the  probability  of  satisfying  both  timing  and 
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consistency  constraints.  One  way  to  make  the  two  orders  compatible  is  to  adjust  serialization  order 
dynamically  to  priority  order.  This  approach  can  be  justified  because  senalization  order  is  not  subject  to 
timing  constraints  as  long  as  it  enforces  serializability.  while  we  assume  that  the  priority  order  of  a  tran¬ 
saction  is  statically  determined  when  it  arrives  in  the  system. 

Integrating  a  concurrency  control  protocol  with  priority-based  scheduling  methods  has  the  inherent 
disadvantage  of  being  limited  by  the  concurrency  control  protocol  on  which  it  depends.  Two-phase  lock¬ 
ing  and  timestamp  ordering  depend  on  the  immediate  validation  of  operations,  and  do  not  provide  a  facil¬ 
ity  to  adjust  senalization  order  dynamically  to  priority  order.  To  adjust  the  senalizauon  order,  we  need  to 
delay  determining  the  serialization  order  of  conflicting  operations,  because  once  the  serializaoon  order  is 
determined,  the  orders  of  operations  from  transactions  cannot  be  adjusted  dynamically, 

In  optimistic  concurrency  control  in  which  the  serializability  test  (called  the  validation  test)  is  made 
only  at  the  end  of  a  transaction,  the  senalization  order  can  be  constructed  dynamically  in  compliance  with 
transaction  timeliness  and  criticality.  Furthermore,  owing  to  its  potential  for  a  high  degree  of  parallelism, 
optimistic  concurrency  control  is  expected  to  perform  better  than  two-phase  locking  or  timestamp  order¬ 
ing  in  real-time  transaction  scheduling. 

We  have  developed  an  optimistic  concurrency  control  protocol  based  on  the  notion  of  dynamic 
timestamp  allocation  [Bok87],  In  this  protocol,  the  serialization  order  is  dynamically  constructed  by 
using  intervals  of  timestamps.  The  protocol  uses  a  backward  validation  scheme,  in  which  validating  a 
transaction  is  performed  against  commuted  transactions.  It  also  updates  the  timestamp  intervals  of  active 
transactions  to  adjust  their  senalization  order.  As  in  other  optimistic  protocols,  the  execution  of  a  transac¬ 
tion  in  our  protocol  is  divided  into  three  phases:  read,  validation,  and  wnte.  However,  unlike  other 
optimistic  protocols,  conflicts  and  nonsenalizable  execuuons  are  detected  during  the  read  phase  of  tran¬ 
saction  execution,  minimizing  wasted  work  due  to  later  restarts  of  transacuons. 

The  goal  of  this  protocol  is  to  enforce  serializability  by  satisfying  the  following  two  conditions  (Cl) 
and  (C2)  through  every  read,  prcwnte,  and  validation.  As  long  as  (Cl)  and  (C2)  are  satisfied,  senaliza¬ 
tion  order  can  be  adjusted  m  favor  of  pnority  order  without  violating  data  consistency. 

(Cl)  Each  timestamp  interval  constructed  when  a  transaction  accesses  a  data  object  should  preserve  the 
order  induced  by  the  timestamps  of  all  committed  transactions  which  have  accessed  that  data 
object. 

(Cl)  The  order  induced  by  timestamp  values  of  a  validating  transaction  should  not  destroy  the  senaliza¬ 
tion  order  constructed  by  the  past  execution,  i.e.,  by  committed  transactions. 

Before  describing  the  algorithms  for  the  read  and  validation  phases,  we  summarize  the  information 
used  to  keep  track  of  the  dependencies  among  transactions: 

•  for  each  active  transaction  T,  its  rcadset,  RS  ( T ),  and  writeset.  ITS (7~); 

•  for  each  committed  transaction  T,  a  timestamp  ts(T)  assigned  in  its  validation  phase; 

•  for  each  active  transaction  T  and  for  each  data  object  x  it  has  read  or  wntten  in  its  read  phase,  an  inter¬ 
val  of  timestamps  KT.xy,  and 

•  for  each  data  object  x,  RTS(x)  and  WTS(x).  which  denote  the  largest  timestamps  of  the  committed 
transacuons  having  read  or  written  x,  respectively. 

In  order  to  decide  whether  a  transaction  T  is  involved  in  a  nonserializable  exccuuon,  all  the  times¬ 
tamp  intervals  of  T  are  grouped  as  KT)  =  f^KT.x)  for  X  being  the  set  of  data  objects  accessed  by  T 

xeX 

KT)  preserves  the  order  between  T  and  committed  transactions.  Any  operation  of  an  active  transacoon  T 
which  introduces  a  nonsenaiizable  execution  can  be  detected  by  checking  whether  the  execution  of  the 
operation  results  in  /  (7~>  =  0. 
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In  the  implementation,  with  each  transaction  7  is  associated  its  current  interval  lc(T)  instead  of 
/(T.jc)'s  and  l(T).  At  the  start  of  7,  lc{T)  is  initialized  as  [0,  »)  (the  whole  set  of  allowable  timestamps). 
For  each  read  or  prewrite  made  by  7,  lc{T)  is  adjusted  according  to  dependencies  induced  by  the  opera¬ 
tion  to  satisfy  (Cl).  A  transaction  7  must  be  restarted  when  4(7)  =  0.  The  gradual  construction  of  a 
senalization  order  by  using  4(7)  makes  it  possible  to  detect  nonserializable  executions  even  before  the 
transaction  reaches  its  validation  phase.  Furthermore,  every  transaction  that  reaches  its  validation  phase 
is  guaranteed  to  commit  in  this  protocol. 

We  present  the  protocol  via  the  following  pseudo  code.  We  bracket  a  cntical  section  by  "<"  and 

and  assume  that  timestamp  intervals  contain  only  integers. 

Read  phase 

<  for  every  data  object  jc  in  RS  (7,)  do 

lc(T,):=lc(.T,)nlWTS(x>+\.~)> 
if  /c(7,  )=0  then  restan(7,) 

<  for  every  data  object  x  in  WS  (7, )  do 

4(7,)  :=  4(7,)n(^7-5(x)+l.  ~)  ^(/?7S(x>+l,  ~)  > 
if  4(7, )=0  then  restan(7,) 

Validation  and  Write  phase 

<  choose  ts(T,)  in  4(7,) 

update  RTS(x)  and  WTS(x)  for  every  x  in  RS(Tt)  and  WS(T,) 
adjust  4(7, )  > 

make  its  updates  permanent  in  the  database 

The  validation  of  a  transaction  means  that  the  execution  of  the  operations  from  the  transaction  is 
serializable,  and  the  execution  should  be  reflected  in  the  serialization  order  of  committed  transactions. 
Thus  we  should  choose  a  timestamp  for  the  transaction  to  satisfy  (C2),  update  RTS  and  WTS  for  data 
objects  it  accessed,  if  necessary,  and  adjust  the  timestamp  intervals  of  all  active  transactions  which 
conflict  with  it  to  satisfy  (Cl).  Any  timestamp  cse4(7,)  satisfies  the  condition  (C2).  The  adjustment  pro¬ 
cedure  is  as  the  following: 

Interval  Adjustment  Operation 

<  for  every  data  object  x  in  RS  (7,)  do 

for  every  transaction  7,  which  has  written  x  do 

4(7,):=4(7,)nttt(7,>+l.-)> 

if  4(7,)=0  then  restan(7;) 

<  for  every  data  object  x  in  WS  (7,)  do 

for  every  transaction  7,  which  has  read  x  do 
4(7,):=  4(7,)n(0.K(7,)-ll 
for  every  transaction  7,  which  has  written  x  do 

4(7,):=4(7,)n(«(7,)+l.~)> 

if  4(7,)=0  then  restart! 7,) 

The  Adjust  procedure  given  above  can  be  modified  in  several  ways  to  integrate  priority  scheduling 
with  this  protocol.  As  a  simple  approach,  we  can  adjust  the  size  of  4(7,)  of  an  active  transaction  7,. 
Because  the  size  is  correlated  with  the  probability  of  restarting  of  the  transaction,  for  priority  scheduling, 
a  transaction  with  higher  priority  needs  to  have  a  larger  timestamp  interval  than  a  transaction  with  lower 
priority.  When  adjusting  the  timestamp  intervals  of  active  transactions,  if  we  give  larger  timestamp 
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intervals  to  transactions  with  higher  priority  over  transactions  with  lower  priority,  then  we  can  decrease 
the  risk  of  restarting  higher  priority  transactions.  The  choice  of  a  timestamp  of  the  validating  transaction 
also  has  a  definite  effect  on  the  active  transactions  which  conflict  with  it,  because  the  timestamp  intervals 
of  those  transactions  are  adjusted  according  to  the  timestamp  chosen. 

As  another  approach,  the  priority  wait  strategy  [Har90]  in  which  the  validating  transaction  waits  for 
the  conflicting  transactions  with  higher  priority  to  complete,  can  also  be  used  in  this  protocol.  The  advan¬ 
tage  of  this  strategy  is  that  a  higher  pnority  transaction  is  not  restarted  due  to  the  validation  of  a  lower 
priority  transaction.  While  a  lower  priority  transaction  is  waiting,  it  is  possible  that  it  will  be  restarted 
due  to  the  validation  of  one  of  the  conflicting  higher  priority  transactions. 

3.  A  New  Parallel  Paradigm  for  Real-Time  Database  System 

One  important  advance  in  computing  technology  is  the  emergence  of  parallel  computers.  In  a  data¬ 
base  system,  there  are  at  least  two  levels  in  which  parallelism  can  be  exploited.  The  first  level  contains 
the  basic  database  operations.  The  basic  idea  behind  these  algorithms  is  to  partition  a  single  database 
operation  into  multiple  sub-operations,  perform  those  sub-operations  simultaneously  and  then  combine 
the  separate  results  into  one.  For  example,  the  join  operation  can  be  performed  in  parallel  by  dividing  one 
of  the  two  relations  into  several  blocks  and  joining  each  block  with  the  other  relation  simultaneously.  As 
a  large  amount  of  data  are  usually  involved  in  each  database  operation,  it  is  essential  from  a  performance 
standpoint  that  accessing  the  data  should  be  done  efficiently.  New  techniques  to  organize  indices  and  to 
structure  data  files  are  needed. 

The  second  level  is  the  query  processing  level  in  which  different  queries  can  be  executed  simultane¬ 
ously  if  they  do  not  conflict.  For  example,  two  CREATE  operations  can  be  executed  in  parallel  on  dif¬ 
ferent  processors  or  the  interpretation  of  two  expressions  can  be  done  simultaneously.  Here  we  are  only 
concerned  with  parallelism  at  the  second  level. 

PRDB  is  an  expenmentai  real-time  database  system  that  runs  on  an  emulated  tightly-coupled, 
shared-memory  multiprocessor  system  in  the  StarLite  software  development  environment,  running  on 
UNIX  under  SunView/X  Windows.  The  overall  design  goal  of  PRDB  is  to  provide  a  general  paradigm 
for  exploring  parallelism  and  implementing  different  real-time  scheduling  policies  in  database  systems. 
The  paradigm  has  evolved  from  the  WorkCrew  model  [Rob89].  The  major  advantage  of  the  WorkCrew 
paradigm  is  its  efficient  mechanisms  to  control  and  manage  parallelism  by  creating  the  minimum  number 
of  processes  in  the  system  and  the  employment  of  a  lazy  evaluation  technique  for  posted  work.  The  syn¬ 
chronization  of  concurrent  tasks  and  the  overhead  of  task  decomposition  are  minimized. 

In  the  WorkCrew  paradigm,  tasks  are  assigned  to  a  finite  set  of  workers.  A  task  may  consist  of 
several  subtasks.  If  some  of  the  subtasks  can  be  executed  in  parallel,  they  are  put  into  a  "request_help" 
queue  of  the  worker.  Any  idle  worker  can  take  over  the  subtasks  and  execute  them.  The  WorkCrew  para¬ 
digm  has  two  advantages.  First,  much  of  the  work  associated  with  task  division  can  be  deferred  until  a 
new  worker  actually  undertakes  the  subtask,  and  avoided  altogether  if  the  original  worker  ends  up  execut¬ 
ing  the  subtask  serially.  Second,  the  number  of  active  workers  in  the  system  is  always  equal  to  the 
number  of  processors. 

However,  the  WorkCrew  paradigm  has  two  limitations  that  prevent  it  from  becoming  a  general 
framework  for  parallel  computing.  The  first  limitation  is  that  there  is  no  general  mechanism  to  retrieve 
results.  In  the  WorkCrew  model,  the  results  of  operations  are  reflected  in  the  preallocated  space.  If  opera¬ 
tions  produce  some  new  results  apart  from  the  results  stored  in  preallocated  space,  which  is  usually  the 
case  for  most  of  the  applications,  there  is  no  way  to  retrieve  those  results.  The  second  limitation  is  that 
there  is  no  way  to  specify  different  operations  to  be  performed  on  data,  i.e.,  the  procedure  to  manipulate  a 
set  of  data  cannot  be  explicitly  passed  to  each  worker  so  that  the  worker  can  perform  different  operations. 
Further,  the  WorkCrew  model  does  not  address  the  real-time  requirements  of  the  application. 

In  our  paradigm,  the  first  limitation  is  addressed  by  providing  a  result  queue  for  the  crew.  The 
second  limitation  is  dealt  with  by  passing  the  handler  for  operations  as  a  parameter  to  each  worker.  These 


improvements  require  the  extension  of  the  concept  of  work.  The  concept  of  work  in  the  WorkCrew  para¬ 
digm  is  a  passive  entity  and  consists  only  of  the  data  items  to  be  manipulated.  In  the  PRDB  paradigm,  the 
concept  of  work  is  still  a  passive  entity,  however,  the  contents  of  work  not  only  consist  of  data  items  to  be 
manipulated,  but  also  the  operation  to  be  performed  on  the  data  items  and  the  timing-constraint  informa¬ 
tion  for  the  work  to  be  performed. 

The  real-time  transaction  scheduler  and  the  CPU  schedulers  (called  dispatchers )  are  separated.  The 
real-time  transaction  scheduler  is  implemented  by  the  crew,  while  the  dispatcher  is  implemented  within 
each  worker.  The  real-time  transaction  scheduler  schedules  tasks  according  to  its  own  policies  and  puts 
them  onto  two  work  queues  residing  on  the  crew.  One  of  these  two  queues  is  for  hard  deadline  tasks  and 
the  other  is  for  soft  deadline  tasks.  Since  each  worker  has  also  its  own  "request_heip"  queue,  the  search 
path  of  work  to  do  by  an  idle  worker  begins  with  the  hard-deadline  queue  of  the  crew,  then  the 
"request_help"  queues  of  the  workers,  and  finally  the  soft  deadline  queue  of  the  crew.  If  the  deadline  has 
passed,  the  workers  immediately  write  the  result  into  the  result  queue  indicating  the  missing  of  a  dead¬ 
line.  Otherwise,  the  work  is  performed  and  results  are  returned  through  the  result  queue.  In  the  case  where 
a  worker  has  to  synchronize  with  other  workers  in  performing  a  task,  the  worker  blocks  and  a  new  worker 
is  created  to  help  the  other  workers’  work.  Thus,  the  number  of  the  active  workers  is  always  equal  to  that 
of  the  processors  in  the  system,  if  the  work  load  is  high. 

The  data  structures  of  a  unit  of  work  and  a  unit  of  result  are  as  follows: 

WORK  =  RECORD 

critical :  CARDINAL;  (*  hard  vs  soft  deadline  *) 

deadline  :  Time;  (*  the  deadline  is  checked  before  executing  the  operation  *) 
operation  :  PROCEDURE;  {*  specifying  the  operation  *) 
paramAddr :  ADDRESS;  (*  pointer  to  the  work  to  be  done  *) 
size  :  CARDINAL;  (*  the  size  of  the  work  data  structure  *) 

END; 

RESULT  =  RECORD 

missDeadline  :  BOOLEAN;  (*  missed  deadline?  *) 
finishTime  :  Time;  (*  the  finished  time  of  a  unit  of  work  *) 
resultAddr :  ADDRESS;  (*  pointer  to  the  result  data  structure  *) 
size  :  CARDINAL;  (*  the  size  of  the  result  data  structure  *) 

END; 

The  major  functions  provided  by  the  paradigm  are  starting  a  crew  of  workers,  destroying  a  crew  of 
workers,  modifying  the  number  of  workers  in  a  crew,  assigning  work  to  a  crew,  requesting  help  by  a 
worker,  testing  whether  the  requested  work  has  been  done  by  other  workers,  and  waiting  for  some  work  to 
be  finished. 

Each  basic  database  operation  is  written  by  using  the  functions  provided  above  if  some  pan  of  the 
basic  database  operation  can  be  done  in  parallel.  Initial  results  have  indicated  the  soundness  of  the  para¬ 
digm  for  parallel  real-time  database  computing.  More  thorough  experiments  are  being  earned  out.  We 
believe  that  this  new  paradigm  will  scale  well  to  large  number  of  processors  in  the  system  and  will  be 
efficient  in  scheduling  real-time  transactions. 

The  data  given  below  are  the  relative  speedups  of  PRDB  over  the  RDB  system.  The  workload  for 
the  experiments  is  the  same  for  the  uniprocessor  which  runs  the  RDB  system  and  the  multiprocessor  sys¬ 
tem  which  runs  PRDB.  The  first  experiment  (Testl)  consists  of  26  "Create''  operations  and  22  "Insert" 
operations.  Each  "Insert"  operation  inserts  15  tuples  in  a  different  relation  with  three  attributes  each. 
Other  experiments  (Tests  2  and  3)  consist  of  the  same  operations  as  Testl,  however,  each  "Insert"  opera¬ 
tion  in  Test2  inserts  25  tuples,  while  each  "Insert"  operation  in  Test3  consists  of  50  Tuples.  The  results 
show  that  PRDB  favors  coarse-grained  parallelism  in  the  compulation. 
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Test3  Time  Units 


26195 


3704 

1.24 


5761 
1.57 
14276 
1.83  . 


2593 

1.77 


3471 
2.60 

7813  6752  5841 
131 131 448 


4.  Concluding  Remarks 

A  real-time  database  manager  is  one  of  the  critical  components  of  a  real-time  system.  To  satisfy 
timing  requirement,  transactions  must  be  scheduled  considering  not  only  the  consistency  constraints  but 
also  their  timing  constraints.  In  addition,  the  system  should  support  a  predictable  behavior  such  that  the 
possibility  of  missing  deadlines  of  critical  tasks  could  be  informed  ahead  of  time,  before  their  deadlines 
expire.  In  this  paper,  we  have  presented  new  paradigms  that  exploit  the  ideas  of  dynamic  adjustment  of 
serialization  order  and  parallel  computing.  We  are  currently  working  on  the  performance  evaluation  of 
new  paradigms  using  the  prototyping  environment  as  well  as  experimental  database  systems. 
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Abstract 

Compared  with  traditional  databases,  database 
systems  for  time-critical  applications  have  the  distinct 
feature  that  they  must  satisfy  timing  constraints  associ¬ 
ated  with  transactions.  Transactions  in  real-time  data¬ 
base  systems  should  be  scheduled  considering  both 
data  consistency  and  timing  constraints.  Since  a  data¬ 
base  system  must  operate  in  the  context  of  available 
operating  system  services,  an  environment  for  database 
systems  development  must  provide  facilities  to  support 
operating  system  functions  and  integrate  them  with 
database  systems  for  experimentation.  We  chose  the 
ARTS  real-time  operating  system  kernel.  In  this  paper, 
we  present  our  experience  in  integrating  a  relational 
database  manager  with  a  real-time  operating  system 
kernel,  and  our  attempts  at  providing  flexible  control 
for  concurrent  transaction  management.  Current 
research  issues  involving  the  development  of  a  pro¬ 
gramming  interface  and  our  efforts  in  using  these  tech¬ 
niques  in  implementing  a  specific  experimental  appli¬ 
cation  are  also  discussed. 


I.  Introduction 

Real-time  computing  is  an  open  research  area 
[Stan88],  The  growing  importance  of  real-time  com¬ 
puting  in  a  large  number  of  applications,  such  as 
aerospace  and  defense  systems,  industrial  automation 
and  robotics,  and  nuclear  power  plants,  has  resulted  in 
an  increased  research  effort  in  this  area.  Recent 
workshops  have  pointed  to  the  need  for  basic  research 
in  database  systems  that  satisfy  timing  constraints  in 
collecting,  updating,  and  retrieving  shared  data,  since 
traditional  data  models  and  databases  are  not  adequate 
for  real-time  systems  [1EEE90,  ONR90].  Very  few 
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conventional  database  systems  allow  users  to  specify 
timing  constraints  or  ensure  that  the  system  meets  those 
set  by  the  user.  Interest  in  this  new  application  domain 
is  also  growing  in  the  database  community.  Recently,  a 
number  of  research  results  have  appeared  in  the  litera¬ 
ture  [Abb89,  Buc89,  Kor90,  Lin90,  Sha88,  Sha91, 
Son88,  Son89,  Son90]. 

Time  is  the  key  factor  to  be  considered  in  real¬ 
time  database  systems,  and  the  correctness  of  the  sys¬ 
tem  depends  not  only  on  the  logical  results  but  also  on 
the  time  within  which  the  results  are  produced.  Tran¬ 
sactions  must  be  scheduled  in  such  a  way  that  they  can 
be  completed  before  their  corresponding  deadlines 
expire.  For  example,  both  the  update  and  query  on  the 
tracking  data  for  a  missile  must  be  processed  within 
given  deadlines,  satisfying  not  only  database  con¬ 
sistency  constraints  but  also  timing  constraints. 

Conventional  database  systems  are  typically  not 
used  in  real-time  applications  due  to  the  inadequacies 
of  poor  performance  and  lack  of  predictability.  Current 
database  systems  do  not  schedule  their  transactions  to 
meet  response  requirements  and  they  commonly  lock 
data  tables  to  assure  only  the  consistency  of  the  data¬ 
base.  Locks  and  time-driven  scheduling  are  basically 
incompatible,  resulting  in  response  requirement  failures 
when  low  priority  transactions  block  higher  priority 
transactions.  New  techniques  arc  required  to  manage 
the  consistency  of  real-  ime  databases,  and  they  should 
be  compatible  with  time-driven  scheduling  and  meet 
both  the  required  system  response  predictability  and 
temporal  consistency. 

To  address  the  inadequacies  of  current  database 
systems,  the  transaction  scheduler  needs  to  be  able  to 
take  advantage  of  the  semantic  and  timing  information 
associated  with  data  objects  and  transactions.  A  model 
of  real-time  transactions  needs  to  be  developed  which 
characterizes  distinctive  features  of  rcal-umc  databases 
that  can  contribute  to  the  improved  responsiveness  of 
the  system.  The  semantic  information  of  the  transac¬ 
tions  investigated  in  the  modeling  study  can  be  used  to 
develop  efficient  transaction  schedulers  [Son90b, 


Son9l], 

A  database  system  must  operate  in  the  context  of 
available  operating  system  services,  because  correct 
iunctioning  and  timing  behavior  of  database  control 
algorithms  depend  on  the  services  of  the  underlying 
operating  system.  As  pointed  out  by  Stonebraker, 
operating  system  services  in  many  systems  are  not 
appropriate  for  support  of  database  functions  [Ston81]. 
In  many  areas,  such  as  buffer  management,  recovery, 
and  consistency  control,  operating  system  facilities 
have  to  be  duplicated  by  database  systems  because  they 
are  too  slow  or  inappropriate.  An  environment  for 
database  systems  development  must,  therefore,  provide 
facilities  to  support  operating  system  functions  and 
integrate  them  with  database  systems  for  experimenta¬ 
tion. 

The  ARTS  real-time  operating  system  kernel, 
under  development  at  Carnegie- Mellon  University, 
attempts  to  provide  a  "predictable,  analyzable,  and  reli¬ 
able  distributed  real-time  computing  environment" 
which  is  an  excellent  foundation  for  a  real-time  data¬ 
base  system  [Tok89].  The  ARTS  system,  which  pro¬ 
vides  support  for  programs  written  in  C  and  C++, 
implements  different  prioritized  and  non-prioritized 
scheduling  algorithms  and  prioritized  message  passing 
as  well  as  supporting  lightweight  tasks.  All  of  these 
features  are  important  when  considering  a  real-time 
database. 

Our  research  effort  resulted  in  a  new  relational 
database  manager  for  distributed  real-time  systems. 
We  have  used  the  relational  database  technology  since 
it  provides  the  most  flexible  means  of  accessing  distri¬ 
buted  data.  In  this  paper,  we  present  our  experience  in 
integrating  a  relational  database  manager  with  a  real¬ 
time  operating  system  kernel,  and  our  attempts  at  pro¬ 
viding  flexible  control  for  concurrent  transaction 
management  using  a  technique  called  workload  media¬ 
tion.  Current  research  issues  involving  the  develop¬ 
ment  of  a  programming  interface,  associated  issues  of 
client/server  object  development  which  can  be 
simplified  through  the  use  of  templates,  and  our  efforts 
in  using  these  techniques  in  implementing  a  specific 
experimental  application  are  also  discussed. 

2.  The  ARTS  Real-Time  OS  Kernel 

Research  in  the  area  of  distributed,  real-time 
operating  systems  indicates  that  most  are  designed  for  a 
specific  need,  and  as  such  are  difficult  to  build,  main¬ 
tain,  and  modify;  in  addition,  they  do  not  afford  the 
capability  of  predicting  runtime  behavior  during  appli¬ 
cation  design.  In  fact,  few  non-real-time  operating  sys¬ 
tems  provide  a  functionally  complete  set  of  general 
purpose,  rcal-time  task  and  time  management  func¬ 
tions,  despite  the  fact  that  the  user  community  is 
expressing  the  desire  for  increasingly  complex 


applications  of  this  type.  Since  the  success  of  applica¬ 
tions  in  real-time  computing  is  primarily  contingent  on 
a  system’s  temporal  functionality,  what  is  needed  is  an 
environment  in  which  the  system  engineer  can  analyze 
and  predict,  during  the  design  stage,  whether  the  given 
real-time  tasks  having  various  types  of  system  and  task 
interactions  (i.e.  memory  allocation/deallocation,  mes¬ 
sage  communications,  I/O  interactions,  etc.)  can  meet 
their  timing  requirements. 

In  an  attempt  to  provide  such  functionality, 
ARTS  provides  the  process  and  data  encapsulation  that 
other  distributed,  object-oriented  operating  systems  do, 
while  at  the  same  time  including  elements  of  temporal 
significance  to  the  services  it  provides.  This  integration 
of  data,  thread  and  concurrency  control  greatly  facili¬ 
tates  real-time  schedulability  analysis.  The  ARTS  can 
support  both  hard  and  soft  real-time  tasks  as  well  as 
periodic  and  sporadic  ones  [Tok89]. 

To  support  time-critical  operations,  the  ARTS 
programming  language  interface  allows  designers  to 
specify  timing  requirements  and  the  chosen  communi¬ 
cation  structure  so  that  they  are  visible  at  both  the 
language  and  system  level;  this  allows  the  system-wide 
ARTS  environment  to  make  scheduling  decisions 
based  on  both  temporal  constraints  and  priority.  The 
Integrated  Time-Driven  Scheduler  (ITDS)  model  of  the 
ARTS  is  more  effective  than  the  common  priority- 
based  preemptive  scheduling  of  many  real-time  sys¬ 
tems.  Such  simple  schedulers  become  confused  during 
heavy  system  loads  when  they  cannot  decide  which 
tasks  are  important  and  should  be  completed  and  which 
tasks  should  be  aborted,  causing  unpredictability  in  the 
applications.  The  ITDS  model  however,  employs  a 
time-varying  "value  function"  which  specifies  both  a 
task’s  lime  criticality  and  semantic  importance  simul¬ 
taneously.  A  hard  real-time  task  can  be  characterized 
by  a  step  function  where  the  discontinuity  occurs  at  the 
deadline,  while  soft  real-time  tasks  are  described  by 
continuous  (linear  or  non-linear)  decreasing  function 
after  its  critical  time.  In  addition,  ARTS’  designers 
have  separated  the  policy  and  mechanism  layers,  so 
that  users  can  implement  new  scheduling  policies  with 
a  minimum  of  effort,  even  dynamically  changing  the 
policy  during  runtime. 

The  issue  of  priority  inversion  is  crucial  to  pro¬ 
viding  semantically  correct  system  behavior  in  addition 
to  addressing  temporal  concerns.  Priority  inversion 
occurs  when  a  high  priority  activity  waits  for  a  lower 
priority  activity  to  complete.  Resource  sharing  and 
communication  among  the  executing  tasks  can  lead  to 
priority  inversion  if  the  operating  system  docs  not 
manage  the  available  resource  set  properly.  Significant 
research  in  the  construction  of  ARTS  was  done  to 
avoid  priority  inversion  among  concurrcnth  executing 
tasks;  in  the  processor  scheduling  domain.  Sow  priority 


servers  which  provide  service  to  clients  of  all  priorities 
are  susceptible  to  inversion.  For  example,  when  a  low 
priority  request  is  being  serviced,  a  high  priority  task 
requests  the  same  service;  since  the  server’s  computa¬ 
tion  is  non-preemptable,  the  high  priority  request  waits. 
Any  task  of  higher  priority  than  the  server  may 
preempt  the  server  itself,  however,  so  if  a  medium 
priority  task  arrives  it  preempts  the  server  indefinitely, 
causing  the  high  priority  job  to  be  lost  in  the  shuffle. 
The  ARTS  employs  a  priority  inheritance  mechanism 
to  propagate  information  about  a  single  computation 
which  crosses  task  boundaries.  That  is,  if  a  server  task 
accepts  the  request  of  a  client,  the  server  inherits  the 
priority  of  the  client.  Furthermore,  the  server  should 
also  inherit  the  priority  of  the  highest  priority  task  wait¬ 
ing  for  the  service. 

The  notion  of  time  encapsulation  cannot  be 
divorced  from  the  basic  structure  of  ARTS,  in  which 
every  computational  entity  is  represented  as  an  object, 
called  an  artobject.  An  artobject  is  defined  as  cither  a 
passive  or  an  active  object.  In  a  passive  object,  there  is 
no  explicit  declaration  of  a  thread  which  accepts 
incoming  invocation  requests  while  an  active  object 
contains  one  or  more  threads  defined  by  the  user.  In  an 
active  object,  its  designer  is  responsible  for  providing 
concurrency  control  among  coexecuting  operations. 
When  a  new  instance  of  an  active  object  is  created,  its 
root  thread  will  be  created  and  run  immediately.  A 
thread  can  create  threads  within  its  object. 

The  ARTS  kernel  supports  the  notion  of  real¬ 
time  objects  and  real-time  threads.  A  real-time  object 
is  defined  with  a  "time  fence,"  a  timer  associated  with 
the  thread  which  ensures  that  the  remaining  slack  time 
is  larger  than  the  worst  case  execution  time  for  the 
operation.  A  real-time  thread  can  have  a  value  function 
and  timing  constraints  related  to  its  execution  period, 
worst  case  execution  time,  phase,  and  delay  value. 
When  an  operation  with  a  time  fence  is  invoked,  the 
operation  will  be  executed  (or  accepted)  if  there  is 
enough  remaining  computation  time  against  the 
specified  worst  case  execution  time  of  the  operation  for 
the  caller.  Otherwise,  it  will  be  aborted  as  a  time  fence 
error.  The  objective  of  this  extension  to  a  normal  object 
paradigm  is  to  prevent  timing  errors  from  crossing  task 
or  module  boundaries  (as  often  happens  in  traditional 
real-time  systems  which  use  a  cyclic  executive)  and 
bind  the  liming  error  at  every  object  invocation. 

On  top  of  the  ARTS  foundation  we  have  built  a 
relational  database  manager  using  message  passing 
primitives  and  employing  the  client/server  paradigm. 
The  result,  RTDB,  currently  consists  of  a  multi¬ 
threaded  server  which  accepts  requests  of  several 
clients.  Based  on  the  temporal  urgency  of  the  request, 
the  server  determines  whether  it  can  commit  the  tran¬ 
saction  or  it  has  to  reject  it. 


3.  Comparison  with  Existing  Systems 

One  of  the  principal  goals  of  the  ARTS  project  is 
to  provide  a  more  easily  extensible  real-time  environ¬ 
ment  than  is  currently  enjoyed  by  programmers 
developing  on  other  kernels.  To  that  end,  ARTS 
requires  better  data  management  facilities  than  many 
other  kernels  offer.  The  RTDB  on  ARTS  represents  a 
combination  of  desirable  aspects  of  database  technol¬ 
ogy  and  development  flexibility.  In  comparing  the 
RTDB  with  other  existing  systems,  we  note  some 
differences  between  it  and  both  research  and  commer¬ 
cial  products.  For  example,  the  CASE-DB  is 
developed  as  a  single-user,  disk-based,  real-time  rela¬ 
tional  DBMS,  which  uses  the  relational  algebra  as  its 
query  language  [Ozso90].  RTDB  diverges  from  this 
design  philosophy  in  many  ways,  being  a  multi-user, 
distributed  real-time  DBMS. 

Supported  media  types  also  differ  among  real¬ 
time  environments.  The  HP-RTDB,  one  of  Hewlett 
Packard’s  Industrial  Precision  Tools,  provides  software 
application  developers  with  a  tool  to  structure  and 
access  memory-resident  data  [Fate89].  Essentially, 
HP-RTDB  is  a  library  of  routines  used  to  define  and 
manipulate  a  database  schema,  build  the  database  in 
memory,  as  well  as  load  and  unload,  and  wnte  or  read 
data  to  and  from  it  They  also  provide  mechanisms  for 
archiving  schema  and  data,  and  storing  timestamp 
information.  The  ARTS-RTDB  provides  a  three  tiered 
approach  for  supported  media  types,  offering  memory- 
resident  data  options,  RAM-based  disk  storage,  and 
access  to  the  UNIX  file  system  for  disk  storage.  Each 
media  has  its  own  advantage.,  and  drawbacks  in  terms 
of  compatibility,  performance,  and  recoverability. 
Naturally,  access  times  decrease  along  this  continuum. 
This  support  of  various  media  types  provides  develop¬ 
ers  the  flexibility  to  choose  appropriately  those  that 
best  suit  their  needs.  Also,  we  provide  the  ability  to 
cross  the  boundaries  between  these  media,  and  to  util¬ 
ize  several  media  types  in  an  individual  query  for  both 
the  source  and  resultant  relations. 

4.  The  RTDB  Real-Time  Database  Manager 

The  RTDB  is  a  relational  database  manager  writ¬ 
ten  in  C  designed  to  run  on  ARTS.  It  offers  not  only  a 
functionally  complete  set  of  relational  operators —  such 
as  join,  projection,  sclccuon,  union,  and  set 
difference —  but  also  other  necessary  operators  as 
create,  insert,  update,  delete,  rename,  compress,  sort, 
extract,  import,  export,  and  print.  These  operators  give 
the  user  a  good  amount  of  relational  power  and  con¬ 
venience  in  managing  the  database. 

We  have  developed  two  different  kinds  of  clients 
for  the  RTDB.  One  is  an  intcracuve  command 
parscr/requcst  generator  that  makes  requests  to  the 
server  on  behalf  of  the  user.  This  client  looks  and 
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Figure  1.  Mediator  object  model 


behaves  similarly  to  a  single-user  database  manager.  It  queue,  instantiating  the  backup  thread,  and  instantiating 

is  possible  to  run  the  client  without  knowing  that  any  the  server  worker  threads.  There  is  at  least  one  server 

interaction  between  server  and  client  is  occurring.  The  worker  created  for  each  thread  priority.  After  complet- 

other  client  is  a  transaction-generating  "batch”  client,  ing  these  tasks,  the  root  thread  enters  an  infinite  loop 

representing  a  real-time  process  that  needs  to  make  that  accepts  database  requests  from  any  client  The 

database  access  requests.  requests  come  in  as  packets.  The  RTDB  provides  two 

The  RTDB  server  object  is  the  heart  of  the  data-  different  ^  of  Packels:  cal1  Packets  and  return  P34*' 

base  management  system.  It  is  responsible  for  creating  et^-  cal*  Packet,  created  by  a  client,  contains  all  the 

and  storing  the  relations,  receiving  and  acting  on  information  that  the  server  needs  to  carry  out  the 
requests  from  multiple  clients,  and  returning  desired  dcsired  database,  access  operation.  Since  different  com- 

informauon  to  the  clients.  mands  re4fuire  different  information,  the  call  packet  has 

a  variant  field  containing  different  information  for  each 

The  server  object  defines  three  threads.  The  root  command.  When  the  server  completes  the  processing 
thread  is  automatically  executed  by  ARTS  upon  invo-  0f  request,  it  returns  a  packet  to  the  client  with  the 

cation  of  the  server.  The  server  activates  one  or  more  information  requested.  This  packet  is  called  a  return 

worker  threads,  and  activates  a  backup  thread  which  is  packet.  The  return  packet  is  created  by  the  server  and 

responsible  for  periodically  backing  up  the  relations  also  ^as  a  variant  field  that  carries  command  specific 

that  reside  only  in  main  memory.  information. 

The  root  thread  of  the  server  is  responsible  for  The  communication  between  the  server  and 

binding  the  server  s  name  in  the  ARTS  name  server  so  clients  is  performed  by  the  ARTS  communication 

that  the  clients  can  find  it  and  send  requests.  It  is  also  primitives:  Request,  Accept,  and  Reply.  The  commun- 

rcsponsible  for  reading  the  relations  into  memory,  ini-  ication  is  synchronous:  when  a  client  issues  a  Request, 

tializing  the  lock  table,  initializing  the  blocked  request  js  blocked  until  the  server  Accepts  and  Replies  to  the 


message.  This  may  cause  some  problems,  especially  in 
a  real-time  environment,  for  two  reasons:  priority 
inversion  and  data  sharing. 

The  ARTS  kernel  (and  thus  the  RTDB  system) 
supports  eight  message  priorities.  When  the  root  thread 
Accepts  a  message,  it  extracts  priority  information 
from  the  message  packet.  The  root  thread  then 
enqueues  the  request  on  the  message  queue  (i.e.  pend¬ 
ing  request  queue)  of  the  worker  thread  designated  to 
service  requests  of  that  priority  level.  If  inactive,  the 
server  worker  will  be  polling  its  queue;  if  active,  the 
requests  will  be  processed  in  FIFO  order.  Note  that  in 
this  way  we  can  easily  exploit  the  scheduling  merits  of 
the  underlying  ARTS  kernel  without  circumventing  its 
priority-based  scheduling  mechanisms.  Since  the 
worker  thread’s  priority  matches  that  of  the  messages  it 
services,  it  will  only  be  scheduled  for  the  CPU  in  an 
interval  where  its  priority  is  currently  the  highest  in  the 
system.  This  is  for  a  general  case.  For  those  instances 
where  the  scheduling  technique  is  not  priority  based,  or 
ARTS  priority  inheritance  mechanism  is  employed, 
these  decisions  will  naturally  be  reflected  in  the  work¬ 
ers. 

This  technique  of  distributing  requests  among  a 
pool  of  workers  based  on  information  contained  in  the 
request  packet  is  called  workload  mediation.  It  is 
intrinsic  in  implementing  various  algorithms  which 
deal  with  semantic  information  provided  by  the  clients 
and/or  the  task  requests  (i.e.  temporal  issues,  priorities, 
etc.).  Determining  the  proper  balance  of  control 
between  ARTS  primitives  and  RTDB  explicit  media¬ 
tion  will  help  us  achieve  the  most  beneficial  symbiosis 
of  the  system’s  resources,  which  is  one  of  the  goals  of 
our  research.  Figure  1  illustrates  the  mediator  mechan¬ 
ism  incorporated  within  the  server  object. 

The  worker  thread  of  the  RTDB  server  performs 
the  client’s  request  to  access  the  database.  It  checks  its 
request  message  queue,  carries  out  the  work  that  is 
requested,  and  replies  back  to  the  client  The  worker 
Replies  to  a  client  without  completing  a  request  when  it 
needs  to  return  more  information  than  can  fit  in  a  single 
packet.  In  such  a  case,  the  client  must  make  continua¬ 
tion  requests  to  the  server  until  it  gets  all  the  informa¬ 
tion  requested. 

To  maintain  the  consistency  of  the  database,  the 
RTDB  server  needs  to  handle  conflicting  requests  prop¬ 
erly.  For  example,  a  problem  occurs  when  some 
request  or  part  of  a  request  (as  in  a  multi-relational 
query)  has  to  be  blocked  since  it  needs  to  lock  a  rela¬ 
tion  that  is  already  locked.  Our  solution  to  this  prob¬ 
lem  is  to  use  a  lock  table  that  keeps  track  of  which  rela¬ 
tions  arc  in  use  at  any  given  time.  If  a  request  for  file  A 
comes  in  while  file  A  is  being  used  by  another  active 
worker,  then  the  new  request  must  be  put  on  an  internal 
queue  until  A  and  any  other  files  it  needs  are  available. 


Whenever  the  worker  becomes  free,  it  first 
checks  its  queue  of  blocked  requests.  If  there  are  any 
requests  in  the  block  queue  that  can  be  unblocked,  it 
dequeues  the  request  and  processes  it.  If  no  request  in 
the  block  queue  is  ready  to  be  processed,  the  worker 
looks  to  its  incoming  request  queue. 

5.  Programming  Interfaces 

Conventional  database  systems  often  provide 
some  interface  through  which  they  export  functionality 
to  application  developers.  Such  programming  inter¬ 
faces  simplify  storage  and  retrieval  tasks  and  provide  a 
scheme  for  the  creation,  manipulation  and  destruction 
of  database  files.  For  systems  utilizing  the  client-server 
paradigm,  communication  primitives  can  also  be 
accessed  through  such  an  interface,  achieving  further 
hiding  of  the  implementation  details. 

To  facilitate  the  construction  of  application 
clients,  we  seek  to  provide  a  programming  interface  for 
the  database  command  set  which  hides  the  implementa¬ 
tion  details  of  the  system  as  much  as  possible.  In  this 
way,  developers  who  are  more  familiar  with  function- 
call  interfaces  (e.g.  SQL)  can  quickly  adjust  to  the  task 
of  constructing  custom  application  clients  rather  than 
application  programs.  Programming  interfaces  in  real¬ 
time  databases  differ  greatly  in  terms  of  application- 
developer  friendliness.  Some  DBMS  interfaces  are 
tightly  coupled  to  theoretical  techniques  such  as  the 
relational  algebra.  CASE-DB  [Ozso90]  is  an  example 
of  this  type  of  interface.  While  this  interface  satisfies 
the  desired  functionality  requirements  for  a  database,  it 
can  be  awkward  to  use  when  developing  large,  com¬ 
plex  applications.  For  these  applications,  it  is  more 
appropriate  to  use  an  interface  similar  to  those  already 
in  use  in  non-reaJ-time  systems.  These  application  pro¬ 
gram  interfaces  consist  of  library  functions. 

In  designing  the  programming  interface  for 
RTDB,  in  addition  to  providing  routines  as  in  other 
relational  databases,  we  seek  to  hide  the  details  of 
ARTS’  Request/Accept/Rcply  message  passing 
sequence.  This  allows  interaction  between  client  and 
server  to  appear  as  if  the  application  client  were  the 
only  one  interacting  with  the  server.  This  goal  is  only 
partially  attainable,  since  the  physical  code  provided  by 
the  application  developer  must  coexist  in  the  same 
source  code  file  as  code  which  specifies  constants  and 
declarations  necessary  to  construct  the  complete  client 
image.  To  expedite  the  development  process,  we  pro¬ 
vide  a  thoroughly  commented,  standardized  client  tem¬ 
plate  with  which  developers  need  only  combine  their 
source  and  compile. 

6.  DOSE:  An  Application  of  the  RTDB 

One  of  the  applications  of  the  RTDB  is  the  Dis¬ 
tributed  Operating  System  Experiment  (DOSE),  as 


presented  in  [Butt90],  The  goal  of  DOSE  is  to  evaluate 
the  feasibility  of  using  a  database  kernel  in  embedded 
systems  with  requirements  for  high  performance  and 
real-time  priority  and  predictability  guarantees. 

The  DOSE  application  consists  of  data  input, 
storage,  display,  and  retrieval  functions.  These  func¬ 
tions  are  implemented  by  four  components:  parser 
manager  (PM),  track  report  manager  (TRM),  graphics 
map  client  (GMC),  and  database  monitor  client  (DMC). 
Figure  2  illustrates  the  information  flow  among  objects 
in  the  DOSE  experiment. 

The  PM  receives  tracking  data  from  data  termi¬ 
nals  or  communication  links  and  converts  them  into  a 
useful  format  such  as  floating  point  and  signed  integer 
numbers.  The  PM  does  not  retain  any  incoming  or  out¬ 
going  information.  The  parsed  data  coming  out  from 
the  PM  are  stored  by  the  TRM.  For  each  new  incoming 
tracking  data,  a  new  data  object  is  created.  For  high 
reliability,  TRM  maintains  replicated  data  objects.  The 
GMC  enables  the  data  to  be  mapped  out  and  visualized 
on  screen.  It  periodically  checks  with  the  TRM  for  the 
latest  updates  to  be  displayed.  The  DMC  monitors  the 
data  objects  in  each  replicated  TRM  database.  Using 
frequent  updates,  it  guarantees  that  data  would  remain 
consistent  across  the  replicated  TRM  databases. 


Without  DMC,  the  survivability  and  consistency  of  the 
system  would  be  weakened. 

The  scenario  used  with  the  DOSE  application  is 
an  outer  air  battle  scenario  generated  by  IBGTT,  the 
Interim  Battle  Group  Tactical  Trainer.  The  data  gen¬ 
erated  by  IBGTT  consists  of  coordinate  and  motion 
data  as  well  as  general  military  classifications  of 
tracked  objects,  called  platforms.  This  data  can  be 
used  to  plot  tactical  information  for  a  variety  of  situa¬ 
tions,  including  personnel  training  programs,  strategic 
simulations,  and  real-time  military  surveillance.  Table 
1  shows  the  attributes  of  data  objects  used  in  the  DOSE 
application. 

Since  some  of  the  fields  above  are  basically  used 
as  categorical  designators,  or  flags,  they  can  be  used  in 
simple  boolean  subqueries  (e.g.,  "where  clauses"  in  the 
RTDB  syntax).  For  example,  an  "H"  value  for  attribute 
cat  indicates  a  hostile  platform;  an  "F\  a  friendly  plat¬ 
form.  A  "Y”  value  for  attribute  nuclear  indicates  a 
confirmed  nuclear  platform:  a  "N”  value  is  a  confirmed 
non-nuclear  platform,  and  "U"  is  unknown.  For  exam¬ 
ple,  a  query  which  seeks  to  determine  all  the  attributes 
of  the  friendly  nuclear  platforms  may  look  as  follows: 
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Figure  2.  Information  flow  among  objects  in  the  DOSE  experiment 


trk_num 

lat_track 

long_track 

bearing 

dep_high 

lat_dlrp 

long_dlrp 

plalform_type 

cat 

time 

trkqa 

lat_tdir 

long_tdir 

coarse 

speed 

range 

nuclear 


track  number 
latitude  of  track 
longitude  of  track 

bearing  from  data  link  reference  point 
depth  or  height  of  platform 
latitude  of  data  link  reference  point 
longitude  of  data  link  reference  point 
type  of  platform 
category  of  platform 
greenwich  mean  time 
confidence  of  measurements 
latitude  direction 
longitude  direction 

bearing  minus  data  link  reference  point 
speed  of  platform 

range  from  data  link  reference  point  in  nautical  miles 
nuclear  classification  of  platform 


Table  1 .  Data  object  attributes  in  the  DOSE  application 


RTDB  =>  print  *  from  trackfile 
where  cat  =  "F"  and  nuclear  =  "Y"; 

A  query  to  display  information  on  all  platforms 
in  a  certain  track  might  look  like  this: 

RTDB  ==>  print  lattrk,  longtrk,  bearing, 
nuclear  from  trktablel  where  trk  =  4741; 

In  implementing  the  TRM  and  DMC  using  the 
RTDB,  the  original  DOSE  tracking  data  has  been 
decomposed  into  several  track  files  of  similar  data.  All 
commands  currently  supported  by  the  RTDB  have  been 
tested  on  the  trackfile  data. 

Modifications  to  earlier  versions  of  RTDB  have 
been  made  to  support  attribute  type  REAL  which  is 
identical  to  the  floating  point  type  of  DOSE  applica¬ 
tion,  and  to  support  aggregate  operators  such  as  SUM, 
COUNT,  MIN,  MAX,  and  AVG.  Although  not 
specifically  delineated  in  the  queries  proposed  in  the 
DOSE  application,  the  addition  of  aggregates  seems 
important  for  the  type  of  queries  expected  for  the  appli¬ 
cation.  For  example,  the  system  may  be  called  to  report 
a  COUNT  of  all  hostile,  nuclear,  air  platforms,  or  the 
bearing  and  speed  of  the  hostile  platform  with  MIN 
range. 

Temporal  database  components  are  being  inves¬ 
tigated  for  inclusion  in  the  RTDB  for  DOSE  applica¬ 
tion.  They  will  address  the  desired  timestamping  of  sur¬ 
veillance  updates  generated  by  radar,  sonar,  or  similar 
equipments,  and  temporal  consistency  requirements  of 


real-time  transactions.  Other  potential  improvements 
in  efficient  implementation  are  being  examined  to 
determine  their  overall  value  to  the  RTDB  system. 
Indices  and  views  are  two  of  them.  Since  such  features 
not  only  alter  the  speed  and  predictability  of  the  system 
but  also  the  basic  file  structure,  they  need  to  be  exam¬ 
ined  closely  on  their  own,  and  then  as  new  elements 
within  the  existing  system. 

7.  Conclusions 

A  real-time  database  manager  is  one  of  the  criti¬ 
cal  components  of  real-time  systems,  in  which  tasks  are 
associated  with  deadlines  and  a  significant  portion  of 
data  is  highly  perishable  in  the  sense  that  it  has  value  to 
the  mission  only  if  used  quickly.  To  satisfy  the  timing 
requirements,  transactions  must  be  scheduled  consider¬ 
ing  not  only  the  consistency  constraints  but  also  their 
liming  constraints.  In  addition,  the  system  should  sup¬ 
port  a  predictable  behavior  such  that  the  possibility  of 
missing  deadlines  of  critical  tasks  could  be  informed 
ahead  of  time,  before  their  deadlines  expire. 

In  this  paper,  we  have  presented  an  experimental 
relational  database  manager  developed  for  distributed 
real-time  systems.  Since  the  characteristics  of  a  real¬ 
time  database  manager  arc  distinct  from  convcnuonal 
database  managers,  there  are  different  kinds  of  issues 
to  be  considered  in  developing  a  real-ume  database 
manager.  For  example,  prionty-based  scheduling  and 
memory  resident  data  have  been  investigated  in  the 
development  of  the  RTDB. 


1990. 


The  foundation  now  exists  for  a  real-time  rela¬ 
tional  database  manager.  However,  as  with  any  active 
research  project,  there  are  many  technical  issues  asso¬ 
ciated  with  real-time  database  systems  that  need  further 
investigation.  It  is  our  goal  to  facilitate  further  develop¬ 
ment  in  this  area,  as  well  as  with  our  RTDB.  To  that 
end,  we  have  discussed  our  work  toward  providing  a 
flexible  programming  interface  and  standard  client  tem¬ 
plate  to  allow  quick  prototyping  and  faster  modeling. 
The  RTDB  described  in  this  paper  with  its  multi¬ 
threaded  server  model,  is  an  appropriate  research  vehi¬ 
cle  for  investigating  new  techniques  and  scheduling 
algorithms  for  distributed  real-time  database  systems. 
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ABSTRACT 


Real-time  database  systems  must  maintain  consistency  while  minimizing  the  number  of  transactions 
that  miss  the  deadline.  To  satisfy  both  the  consistency  and  real-time  constraints,  there  is  the  need  to 
integrate  synchronization  protocols  with  real-time  priority  scheduling  protocols.  One  of  the  reasons  for 
the  difficulty  in  developing  and  evaluating  database  synchronization  techniques  is  that  it  takes  a  long  time 
to  develop  a  system,  and  evaluation  is  complicated  because  it  involves  a  large  number  of  system  parame¬ 
ters  that  may  change  dynamically.  This  paper  describes  an  environment  for  investigating  distributed  real¬ 
time  database  systems.  The  environment  is  based  on  concurrent  programming  kernel  which  supports  the 
creation,  blocking,  and  termination  of  processes,  as  well  as  scheduling  and  inter-process  communication. 
The  contribution  of  the  paper  is  the  introduction  of  a  new  approach  to  system  development  that  utilizes  a 
module  library  of  reusable  components  to  satisfy  three  major  goals:  modularity,  flexibility,  and  extensi¬ 
bility.  In  addition,  experiments  of  real-time  concurrency  control  techniques  are  presented  to  illustrate  the 
effectiveness  of  the  environment. 

Index  Terms  -  distributed  database,  prototyping,  synchronization,  transaction,  real-time 
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1.  Introduction 


In  this  paper,  we  report  our  experiences  with  a  new  approach  to  integrated  development  and  evalua¬ 
tion  of  real-time  distributed  database  systems,  and  present  experimental  results  of  various  real-time  syn¬ 
chronization  techniques.  The  goal  of  the  project  is  to  test  the  hypothesis  that  a  host  environment  can  be 
used  to  significantly  accelerate  the  rate  at  which  we  can  perform  experiments  in  the  areas  of  operating 
systems,  databases,  and  network  protocols  for  real-time  systems.  A  tool  for  developing  components  of 
real-time  distributed  systems  and  integrating  them  to  evaluate  design  alternatives  is  essential  for  the 
advance  of  real-time  computing  technology.  To  the  best  of  our  knowledge,  this  is  the  first  successful 
attempt  to  develop  such  a  tool  as  an  environment  consisting  of  a  hybrid  of  actual  implementation  and 
simulation. 

As  computers  are  becoming  essential  part  of  real-time  systems,  real-time  computing  is  emerging  as 
an  important  discipline  in  computer  science  and  engineering  [Shin87],  The  growing  importance  of  real¬ 
time  computing  in  a  large  number  of  applications,  such  as  aerospace  and  defense  systems,  industrial  auto¬ 
mation,  and  nuclear  reactor  control,  has  resulted  in  an  increased  research  effort  in  this  area.  Researchers 
working  on  developing  real-time  systems  based  on  distributed  system  architecture  have  found  out  that 
database  managers  are  assuming  much  greater  importance  in  real-time  systems.  In  the  recent  workshops 
developers  of  "real"  real-time  systems  pointed  to  the  need  for  basic  research  in  database  systems  that 
satisfy  timing  constraint  requirements  in  collecting,  updating,  and  retrieving  shared  data  [IEEE90, 
ONR90],  Further  evidence  of  its  importance  is  the  recent  growth  of  research  in  this  field  and  the 
announcements  by  some  vendors  of  database  products  that  include  features  achieving  high  availability 
and  predictability  [Son88], 

In  addition  to  providing  relational  access  capabilities,  distributed  real-time  database  systems  offer  a 
means  of  loosely  coupling  software  processes;  therefore,  making  it  easier  to  rapidly  update  software,  at 
least  from  a  functional  perspective.  However,  with  respect  to  time-driven  scheduling  and  system  timing 
predictability,  they  present  new  problems.  One  of  the  characteristics  of  current  database  managers  is  that 


they  do  not  schedule  their  transactions  to  meet  response  requirements  and  they  commonly  lock  data  tables 
indiscriminately  to  assure  database  consistency.  Locks  and  time-driven  scheduling  are  basically  incompa¬ 
tible.  Low  priority  transactions  can  and  will  block  higher  priority  transactions  leading  to  response  require¬ 
ment  failures.  New  techniques  are  required  to  manage  database  consistency  which  are  compatible  with 
time-driven  scheduling  and  the  essential  system  response  prcdictability/analyzability  it  brings.  One  of  the 
primary  reasons  for  the  difficulty  in  successfully  developing  and  evaluating  new  database  techniques  is 
that  it  takes  a  long  time  to  develop  a  system,  and  evaluation  is  complicated  because  it  involves  a  large 
number  of  system  parameters  that  may  change  dynamically. 

A  prototyping  technique  can  be  applied  effectively  to  the  evaluation  of  database  techniques  for  dis¬ 
tributed  real-time  systems.  In  this  paper,  we  report  our  experiences  with  a  new  database  prototyping 
environment.  It  is  constructed  to  support  research  in  distributed  database  and  operating  system  technol¬ 
ogy  for  real-time  applications.  A  database  prototyping  environment  is  a  software  package  that  supports 
the  investigation  of  the  properties  of  database  techniques  in  an  environment  other  than  that  of  the  target 
database  system.  The  advantages  of  an  environment  that  provides  prototyping  capability  are  obvious. 
First,  it  is  cost  effective.  If  experiments  for  a  twenty-node  distributed  database  system  can  be  executed  in 
a  software  environment,  it  is  not  necessary  to  purchase  a  twenty-node  distributed  system,  reducing  the 
cost  of  evaluating  design  alternatives.  Second,  design  alternatives  can  be  evaluated  in  a  uniform  environ¬ 
ment  with  the  same  system  parameters,  making  a  fair  comparison.  Finally,  as  technology  changes,  the 
environment  need  only  be  updated  to  provide  researchers  with  the  ability  to  perform  new  experiments. 

A  prototyping  environment  can  reduce  the  time  of  evaluating  new  technologies  and  design  alterna¬ 
tives.  From  our  past  experience,  wc  assume  that  a  relatively  small  portion  of  a  typical  database  system’s 
code  is  affected  by  changes  in  specific  control  mechanisms,  while  the  majority  of  code  deals  with  intrin¬ 
sic  problems,  such  as  file  management.  Thus,  by  properly  isolating  technology-dependent  portions  of  a 
database  system  using  modular  programming  techniques,  we  can  implement  and  evaluate  design  alterna¬ 
tives  very  rapidly.  In  addition,  a  prototyping  environment  provides  a  friendlier  development  environment 
than  a  target  hardware  system.  The  bare  machine  environment  is  the  worst  possible  place  in  which  to 
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explore  new  software  concepts.  For  example,  even  the  recovery  of  the  event  history  leading  up  to  an 
error  in  a  distributed  system  can  be  a  difficult  and,  in  some  cases,  an  impossible  task.  Debugging  is 
greatly  facilitated  in  a  prototyping  environment.  The  symbolic  debugger  of  our  environment  supports  the 
examination  of  an  arbitrary  number  of  execution  threads.  As  a  result,  the  state  of  a  distributed  computa¬ 
tion  can  be  examined  as  a  whole. 

Although  there  exist  tools  for  system  development  and  analysis,  few  prototyping  tools  exist  for  dis¬ 
tributed  database  experimentation,  especially  for  distributed  real-time  database  systems.  Recently,  simu¬ 
lators  have  been  developed  for  investigating  performance  of  several  concurrency  control  algorithms  for 
real-time  applications  [Abb88,  Abb89,  Raj89).  However,  they  do  not  provide  a  module  hierarchy  com¬ 
posed  from  reusable  components  as  in  our  prototyping  environment.  Software  developed  in  our  prototyp¬ 
ing  environment  will  execute  in  a  given  target  machine  without  modification  of  any  layer  except  the 
hardware  interface.  In  addition,  since  our  environment  is  a  hybrid  of  prototyping  and  simulation  (i.e., 
partially  implemented  and  partially  simulated),  we  can  capture  important  timing  features  of  the  system, 
while  it  is  very  hard  using  simulation  only. 

A  database  system  must  operate  in  the  context  of  available  operating  system  services.  In  other 
words,  database  operations  need  to  be  coherent  with  the  operating  system,  because  correct  functioning 
and  timing  behavior  of  database  control  algorithms  depend  on  the  services  of  the  underlying  operating 
system.  Unless  you  have  a  control  over  the  operating  system,  investigating  timing  behavior  of  a  database 
system  does  not  provide  much  information.  An  environment  for  database  systems  development  must, 
therefore,  provide  facilities  to  support  operating  system  functions  and  integrate  them  with  database  sys¬ 
tems  for  experimentation. 

Another  important  use  of  a  prototyping  environment  is  to  analyze  the  reliability  of  database  control 
mechanisms  and  techniques.  Since  distributed  systems  are  expected  to  work  correctly  under  various 
failure  situations,  the  behavior  of  distributed  database  systems  in  degraded  circumstances  needs  to  be  well 
understood.  Although  new  approaches  for  synchronization  and  checkpointing  for  distributed  databases 
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have  been  developed  recently  [Liu87,  Kor90,  Lin90,  Son89,  Son90],  experimentation  to  verify  their  pro¬ 
perties  and  to  evaluate  their  performance  has  not  been  performed  due  to  the  lack  of  appropriate  test  tools. 

When  developing  a  database  system,  functional  completeness  and  performance  of  the  system  are  of 
primary  concern.  The  resulting  systems  are  often  not  layered  or  modular  in  their  implementation.  How¬ 
ever,  for  experimentation,  a  layered  implementation  approach  facilitates  the  rapid  evaluation  of  new  tech¬ 
niques.  Such  a  facility  improves  significantly  the  capability  of  the  system  designer  in  comparing  design 
alternatives  in  a  uniform  environment.  In  this  regard,  the  concept  of  developing  a  methodology  for  lay¬ 
ered  implementation  of  the  system  and  building  a  library  of  modules  with  different 
pcrformance/reliability  characteristics  for  operating  system  and  database  system  functions  seems  promis¬ 
ing.  The  prototyping  environment  we  have  developed  follows  this  approach  [Cook87,  Son88b], 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  presents  an  informal  description  of  a 
message-based  simulation.  Section  3  describes  the  design  principles  and  the  current  implementation  of 
the  prototyping  environment.  Section  4  presents  experimentations  of  priority-based  synchronization  algo¬ 
rithms  and  multivcrsion  data  objects  using  the  prototyping  environment.  Section  5  concludes  the  paper. 

2.  Message-Based  Simulation 

When  prototyping  distributed  database  systems,  there  are  two  possible  approaches:  sequential  pro¬ 
gramming  and  distributed  programming  based  on  message-passing.  Message-based  simulations,  in  which 
events  are  message-communications,  do  not  provide  additional  expressive  power  over  standard  simula¬ 
tion  languages;  message-passing  can  be  simulated  in  many  discrete -event  simulation  languages  including 
SIMSCRIPT  [Kiv69]  and  GPSS  [Sch74],  However,  a  message-based  simulation  can  be  used  as  an  effec¬ 
tive  tool  for  developing  a  distributed  system  because  the  simulation  "looks"  like  a  distributed  program, 
while  a  simulation  program  written  in  a  traditional  simulation  language  is  inherently  a  sequential  pro¬ 
gram.  Furthermore,  if  a  simulation  program  is  developed  in  a  systematic  way  such  that  the  principles  of 
modularity  and  information  hiding  are  observed,  most  of  the  simulation  code  can  be  used  in  the  actual 
system,  resulting  in  a  reduced  cost  for  system  development  and  evaluation. 
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To  prototype  a  distributed  database  system  on  a  single  host  machine,  it  is  necessary  to  provide  vir¬ 
tual  machines  for  each  node  of  the  system  being  simulated.  For  that,  the  process  view  of  a  system  has 
been  adopted.  A  distributed  system  being  simulated  consists  of  a  number  of  processes  which  interact 
with  others  at  discrete  instants  of  time.  Processes  are  basic  building  blocks  of  a  simulation  program.  A 
process  is  an  independent,  dynamic  entity  which  manipulate  resources  to  achieve  its  objectives.  A 
resource  is  a  passive  object  and  may  be  represented  by  a  simple  variable  or  a  complex  data  structure.  A 
simulation  program  models  the  dynamic  behavior  of  processes,  resources,  and  their  interactions  as  they 
evolve  in  time.  Each  physical  operation  of  the  system  is  simulated  by  a  process,  and  the  process  interac¬ 
tions  are  called  events. 

In  the  literature,  the  notion  of  a  process  has  been  given  numerous  definitions.  The  definition  used  in 
our  model  is  much  the  same  as  that  given  in  [Bri78] :  A  process  is  the  execution  of  an  interruptible 
sequential  program  and  represents  the  unit  of  resource  allocation,  such  as  the  allocation  of  CPU  time, 
main  memory  and  I/O  devices. 

We  use  the  client/server  paradigm  for  process  interaction  in  the  prototyping  environment.  The  sys¬ 
tem  consists  of  a  set  of  clients  and  servers,  which  are  processes  that  cooperate  for  the  purpose  of  transac¬ 
tion  processing.  Each  server  provides  a  service  to  its  clients,  where  a  client  can  request  a  service  by  send¬ 
ing  a  request  message  (a  message  of  type  request)  to  the  corresponding  server.  The  computation  structure 
of  the  system  to  be  modeled  can  be  characterized  by  the  way  clients  and  servers  are  mapped  into 
processes.  For  example,  a  server  might  consist  of  a  fixed  number  of  processes,  each  of  which  may  execute 
requests  from  every  transaction,  or  it  might  consists  of  varying  number  of  processes,  each  of  which  exe¬ 
cutes  on  behalf  of  exactly  one  transaction. 

Internal  actions  of  a  process,  i.c.,  actions  that  do  not  involve  interactions  with  other  processes  in  the 
system,  arc  modeled  cither  by  the  passage  of  simulation  time  or  by  the  execution  of  sequential  statements 
within  the  process.  We  use  a  simulator  clock  to  represent  the  passage  of  time  in  a  simulation.  The  simula¬ 
tor  clock  advances  in  discrete  steps,  where  each  step  simulates  the  passage  of  time  between  two  events  in 
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the  system. 


In  a  physical  system,  each  process  makes  independent  progress  in  time  the  resources  they  need 
are  available,  and  many  processes  execute  in  parallel.  In  its  simulation,  the  multiple  processes  of  a  physi¬ 
cal  system  must  be  executed  simultaneously  on  one  processor.  This  simultaneity  is  achieved  in  the  proto¬ 
typing  environment  by  supporting  a  simultaneous  execution  of  multiple  processes  in  a  single  address 
space. 

A  message-based  prototyping  environment  can  be  of  enormous  benefit  in  designing  and  testing 
emerging  systems,  such  as  real-time  systems,  and  in  comparing  and  improving  algorithms  that  arc  appli¬ 
cable  to  many  different  systems.  One  such  benefit  is  that  the  software  to  be  used  in  an  actual  system  can 
be  developed  using  the  environment.  The  prototyping  environment  can  support  a  simulated  environment, 
actual  hardware,  or  a  “hybrid”  mode  in  which  some  of  the  modules  are  implemented  in  hardware  and 
some  are  simulated.  In  this  way,  it  is  irrelevant  to  the  software  developer  using  the  environment  whether 
or  not  all  or  part  of  the  software  is  running  on  hardware.  When  the  system  is  running  in  a  hybrid  mode, 
the  virtual  clock  used  for  performance  measurement  is  updated  by  the  actual  time  used  for  direct  execu¬ 
tion,  making  performance  measurements  correct. 

3.  Structure  of  the  Prototyping  Environment 

The  prototyping  environment  is  designed  to  facilitate  easy  extensions  and  modifications.  Server 
processes  can  be  created,  relocated,  and  new  implementations  of  server  processes  can  be  dynamically 
substituted.  It  efficiently  supports  a  spectrum  of  real-time  database  functions  at  the  operating  system 
level,  and  facilitates  the  construction  of  multiple  database  systems  with  different  characteristics.  For 
experimentation,  system  functionality  can  be  adjusted  according  to  application-dependent  requirements 
without  much  overhead  for  new  system  setup.  Since  one  of  the  design  goals  of  the  prototyping  environ¬ 
ment  is  to  conduct  an  empirical  evaluation  of  the  design  and  implementation  of  real-time  distributed  data¬ 
base  systems,  it  has  built-in  support  for  performance  measurement  of  both  elapsed  time  and  blocked  time 
for  each  transaction. 
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The  prototyping  environment  provides  support  for  transaction  processing,  including  transparency  to 
concurrent  access,  data  distribution,  and  atomicity.  An  instance  of  the  prototyping  environment  can 
manage  any  number  of  virtual  sites  specified  by  the  user.  Modules  that  implement  transaction  processing 
arc  decomposed  into  several  server  processes,  and  they  communicate  among  themselves  through  ports. 
The  clean  interface  between  server  processes  simplifies  incorporating  new  algorithms  and  facilities  into 
the  prototyping  environment,  or  testing  alternate  implementations  of  algorithms.  To  permit  concurrent 
transactions  on  a  single  site,  there  is  a  separate  process  for  each  transaction  that  coordinates  with  other 
server  processes. 

Figure  1  illustrates  the  structure  of  the  prototyping  environment.  The  prototyping  environment  is 
based  on  a  concurrent  programming  kernel,  called  the  StarLite  kernel.  The  StarLite  kernel  supports  pro¬ 
cess  control  to  create,  ready,  block,  and  terminate  processes.  It  also  supports  the  semaphore  abstraction  to 
be  used  by  higher  level  modules  in  resource  control,  critical  section  implementation,  and  synchronous 
message-passing.  The  internal  structure  of  the  kernel  follows  the  well-known  client-server  model 
(Tan871,  in  which  most  of  the  operating  system  operates  as  server  processes  in  the  same  address  space  as 
client  processes,  with  the  kernel  merely  handling  message  communication  between  various  processes. 
Figure  2  shows  an  instance  of  this  model.  This  structure  is  particularly  useful  for  extensible  systems  such 
as  our  prototyping  environment,  since  additional  or  alternative  functionality  can  easily  be  provided  by 
creating  a  new  server,  instead  of  changing  and  recompiling  the  kernel. 

Scheduler  in  the  kernel  maintains  a  virtual  clock  and  provides  the  hold  primitive  to  control  the  pas¬ 
sage  of  time.  The  benefit  of  a  virtual  clock  is  that  any  number  of  performance  monitoring  operations  may 
be  performed  at  an  instant  of  virtual  time.  If  a  physical  clock  were  embedded,  the  monitoring  activities 
themselves  would  interfere  with  other  system  activities  and  add  to  the  execution  time,  resulting  in 
incorrect  performance  measures. 

The  kernel  alee  provide0  'he  capability  of  isolating  overhead  imposed  by  each  system  component. 
For  instance,  total  time  at  each  node  can  be  divided  into  CPU  time  and  I/O  time,  to  determine  the 
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computation-intensive  and  I/O-intensive  functions  and  investigate  the  distribution  of  tasks  around  the  sys¬ 
tem  so  as  to  maximize  parallelism. 


The  User  Interface  (UI)  is  a  front-end  invoked  when  the  prototyping  environment  begins.  UI  is 
menu-driven,  and  designed  to  be  flexible  in  allowing  users  to  experiment  various  configurations  with  dif¬ 
ferent  system  parameters.  A  user  can  specify  the  following: 

•  system  configuration:  number  of  sites  and  the  number  of  server  processes  at  each  site,  topology  and 
communication  costs. 

•  database  configuration:  database  at  each  site  with  user  defined  structure,  size,  granularity,  and  levels  of 
replication. 


Fig.  1.  Structure  of  the  prototyping  environment 


•  load  characteristics:  number  of  transactions  to  be  executed,  size  of  their  read-sets  and  write-sets,  tran¬ 
saction  types  (read-only  or  update)  and  their  priorities,  and  the  mean  interarrival  time  of  transactions. 

•  concurrency  control:  locking,  timestamp  ordering,  and  priority-based. 

The  UI  initiates  the  Configuration  Manager  (CM)  which  initializes  necessary  data  structures  for 
transaction  processing  based  on  user  specification.  The  database  at  each  site  consists  of  different  number 
of  files,  and  each  file  consists  of  different  number  of  records.  The  database  structure  can  be  made  compli¬ 
cated  if  necessary.  However,  we  use  a  simple  file  access,  since  investigating  synchronization  problems 
docs  not  require  complex  database  structures. 

The  CM  invokes  the  Transaction  Generator  at  an  appropriate  time  interval  to  generate  the  next  tran¬ 
saction  to  form  a  Poisson  process  of  transaction  arrival.  The  environment  is  flexible  enough  to  generate 
any  number  of  transactions  with  different  characteristics.  The  user  can  specify  his  own  procedure  for  tran¬ 
sactions.  At  initialization  time,  the  user- specified  procedure  is  converted  into  a  transaction  process. 
Furthermore,  the  prototyping  environment  supports  the  facility  that  allows  mixing  system  generated  tran¬ 
sactions  with  user-specified  ones.  It  is  very  desirable  to  have  such  a  capability,  since  the  user  can  setup 
any  workload  that  represents  the  situation  to  be  simulated,  with  or  without  system  generated  background 
workload. 

A  transaction  is  distinguished  from  the  other  processes  in  the  system  by  its  behavior.  To  the  system, 
the  only  distinction  between  transactions  and  server  processes  is  the  PortTags  on  which  each  receives 
messages.  When  a  transaction  is  generated,  it  is  assigned  an  identifier  that  is  unique  among  all  transac¬ 
tions  in  the  system.  Each  transaction  is  also  assigned  a  globally  unique  timestamp  which  is  hidden  within 
a  single  module.  The  advantage  of  extracting  the  definition  and  assignment  of  the  timestamp  from  its  use 
is  that  it  provides  a  means  of  uniquely  assigning  timestamps  which  is  independent  from  any  specific 
implementation. 

The  timestamp  assignment  is  closely  related  to  the  clocks  in  the  system.  In  a  sequential  simulation, 
a  single  clock  suffices  to  order  events  in  the  system.  An  event  is  taken  off  the  event  queue,  and  the  global 
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Machine  1  Machine  2  Machine  3 


client  to  server 

Fig.  2.  Gient-server  model 


clock  is  advanced  to  the  time  required  for  the  event  to  occur.  Events  arc  related  in  time  by  their  relation  to 
the  global  clock.  In  prototyping  distributed  environments,  no  such  global  clock  is  available.  Time  is 
referred  to  by  iocal  clocks,  which  is  maintained  at  each  site  and  visible  only  to  processes  at  that  site.  Ord¬ 
ering  of  events  in  terms  of  the  global  time,  therefore,  depends  on  the  proper  synchronization  of  local 
clocks.  In  our  environment,  clocks  are  synchronized  by  intersite  communication.  An  intersite  message 
includes  the  clock  value  of  the  sender  site  at  the  time  the  message  is  sent.  If  the  sum  of  this  clock  value 
and  the  propagation  delay  between  the  sites  is  greater  than  the  clock  value  at  the  receiver  site,  the  receiver 
increments  its  clock  by  the  difference  between  the  sum  and  its  clock  value.  In  this  way,  all  succeeding 
events  at  the  receiver  site  can  be  said  to  occur  after  the  sending  of  the  message.  This  satisfies  our  intuitive 
notion  of  "happens  before"  relationship  [Lam78], 

Transaction  execution  consists  of  read  and  write  operations.  Each  read  or  write  operation  is  pre¬ 
ceded  by  an  access  request  sent  to  the  Resource  Manager,  which  maintains  the  local  database  at  each  site. 
Each  transaction  is  assigned  to  the  Transaction  Manager  (TM).  The  TM  issues  service  requests  on  behalf 
of  the  transaction  and  reacts  appropriately  to  the  request  replies.  For  instance,  if  a  transaction  requests 
access  to  a  file  and  that  file  is  locked,  TM  executes  either  blocking  operation  to  wait  until  the  data  object 
can  be  accessed,  or  aborting  procedure,  depending  on  the  situation.  If  granting  access  to  a  resource  will 
produce  deadlock,  TM  receives  an  abort  response  and  aborts  the  transaction.  Transactions  commit  in  two 
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phases.  The  first  commit  phase  consists  of  at  least  one  round  of  messages  to  determine  if  the  transaction 
can  be  globally  committed.  Additional  rounds  may  be  used  to  handle  potential  failures.  The  second  com¬ 
mit  phase  causes  the  data  objects  to  be  written  to  the  database  for  successful  transactions.  TM  executes 
the  two  commit  phases  to  ensure  that  a  transaction  commits  or  aborts  globally.  Figure  3  illustrates  a 
queueing  model  adopted  for  transaction  processing. 

Transactions  are  generated  and  put  into  the  start-up  queue.  When  a  transaction  is  started,  it  leaves 
the  start-up  queue  and  enters  the  ready  queue.  The  transaction  at  the  top  of  the  queue  is  selected  to  run. 
The  current  running  transaction  sends  requests  to  the  concurrency  controller  (CC)  implemented  in  the 
resource  manager.  The  transaction  may  be  blocked  and  placed  in  the  block  queue.  It  may  also  be  aborted 
and  restarted.  In  such  a  case,  it  is  first  delayed  for  a  certain  amount  of  time  and  then  put  in  the  ready 
queue  again.  When  a  transaction  in  the  block  queue  is  unblocked,  it  leaves  the  block  queue  and  is  placed 
in  the  ready  queue  again. 
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In  prototyping  distributed  database  systems,  a  communication  network  is  an  important  component 
to  be  simulated,  since  the  system  performance  depends  heavily  on  the  topology  and  communication  pro¬ 
tocols  used.  However,  in  many  database  simulators,  the  communication  subsystem  is  either  ignored  or 
simplified  by  adding  communication  cost  to  the  transaction  processing  time.  Our  prototyping  environ¬ 
ment  uses  a  different  approach  by  providing  a  virtual  communication  network  that  actually  runs  a  layered 
communication  protocol  on  a  network  topology  specified  by  the  user.  Since  the  communication  module 
is  a  separate  building  block  in  the  prototyping  environment,  the  user  can  change  it  to  simulate  different 
requirements  of  the  application. 

The  Message  Server  (MS)  is  a  process  listening  on  a  well-known  port  for  messages  from  remote 
sites.  When  a  message  is  sent  to  a  remote  site,  it  is  placed  on  the  message  queue  of  the  destination  site 
and  the  sender  blocks  itself  on  a  private  semaphore  until  the  message  is  retrieved  by  MS.  If  the  receiving 
site  is  not  operational,  a  time-out  mechanism  will  unblock  the  sender  process.  When  MS  retrieves  a  mes¬ 
sage,  it  wakes  the  sender  process  and  forwards  the  message  to  the  proper  servers  or  TM.  The  prototyping 
environment  supports  both  Ada-style  rendezvous  (synchronous)  as  well  as  asynchronous  message  pass¬ 
ing.  Inter-process  communication  within  a  site  docs  not  go  through  the  Message  Server,  processes  send 
and  receive  messages  directly  through  their  associated  pori« 

The  inter-process  communication  structure  is  designed  to  provide  a  simple  and  flexible  interface  to 
the  client  processes  of  the  application  software,  independent  from  the  low-level  hardware  configurations. 
It  is  split  into  three  levels  of  hierarchy:  transport,  network,  and  physical  layers. 

The  Transport  layer  is  the  interface  to  the  application  software,  thus  it  is  designed  to  be  as  abstract 
as  possible  in  order  to  support  different  port  structures  and  various  message  types.  In  addition,  application 
level  processes  need  not  know  the  details  of  the  destination  device.  The  invariant  built  into  the  design  of 
the  intcr-proccss  communication  interface  is  that  the  application  level  sender  allocates  the  space  for  a 
message,  and  the  receiver  deallocates  it.  Thus,  it  is  irrelevant  whether  or  not  the  sender  and  receiver  share 
memory  space,  i.c.,  whether  or  not  the  Physical  layer  on  the  sender’s  side  copies  the  message  into  a  buffer 


-12- 


and  deallocates  it  at  the  sender’s  site,  and  the  Physical  layer  at  the  receiver’s  site  allocates  space  for  the 
message.  This  enables  prototyping  distributed  systems  or  multiprocessors  with  no  shared  memory,  as 
well  as  multiprocessors  with  shared  memory  space.  When  prototyping  the  latter,  only  addresses  need  to 
be  passed  in  messages  without  intermediate  allocation  and  deallocation. 

The  Physical  layer  of  message  passing  simulates  the  physical  sending  and  receiving  of  bits  over  a 
communication  medium,  i.e.,  it  is  for  intersite  message  passing.  The  device  number  in  the  interface  is 
simply  a  cardinal  number,  this  enables  the  implementation  to  be  simple  and  extensible  enough  to  support 
any  application.  To  simulate  sending  or  to  actually  send  over  an  Ethernet  in  the  target  system,  for  exam¬ 
ple,  a  module  could  map  network  addresses  onto  cardinals.  To  send  from  one  processor  to  another  in  a 
multiprocessor  or  in  a  distributed  system,  the  cardinals  can  represent  procesour  numbers. 

Messages  are  passed  to  specific  processes  at  specific  sites  in  the  Network  layer  of  the  communica¬ 
tions  interface.  This  layer  serves  to  separate  the  Transport  and  the  Physical  layers,  so  that  the  Transport 
layer  interface  can  be  processor-  and  process-independent  and  the  Physical  layer  interface  need  be  con¬ 
cerned  only  with  the  sending  of  bits  from  one  site  to  another.  The  Transport  layer  interface  of  the  com¬ 
munication  subsystem  is  implemented  in  the  Traasport  module.  A  Transport-level  Send  is  made  to  an 
abstraction  called  a  PortTag.  This  abstraction  is  advantageous  because  the  implementation  (i.e.,  what  a 
PortTag  represents)  is  hidden  in  the  Ports  module.  Thus  the  PortTag  can  be  mapped  onto  any  port  struc¬ 
ture  or  the  reception  points  of  any  other  message  passing  system.  The  Transport- level  Send  operation 
builds  a  packet  consisting  of  the  sender’s  PortTag,  used  for  replies,  the  destination  PortTag,  and  the 
address  of  the  message.  It  then  retrieves  from  the  destination  PortTag  the  destination  device  number.  If 
this  number  is  the  same  as  the  sender’s,  the  Send  is  an  intra-site  message  communication,  and  hence  the 
Network-level  Send  is  performed.  Otherwise  the  send  requires  the  Physical  module  for  intersite  commun¬ 
ication. 

Note  that  accesses  to  the  implementation  details  of  the  PortTag  are  restricted  to  the  module  that 
actually  implements  it;  this  enables  changing  the  implementation  without  recompiling  the  rest  of  the  sys- 
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The  Performance  Monitor  interacts  with  the  transaction  managers  to  record,  priori tyAimestamp  and 
read/write  data  set  for  each  transaction,  time  when  each  event  occurred,  statistics  tor  ca.h  transaction  and 
epu  hold  interval  in  each  node.  The  statistics  for  a  transaction  includes  arrival  time,  start  time,  total  pro¬ 
cessing  time,  blocked  interval,  whether  deadline  was  missed  or  not,  and  the  number  of  aborts. 

Since  each  TM  is  a  separate  process,  each  has  its  own  data  area  in  which  to  keep  track  of  the  lime 
when  a  service  request  is  sent  out  and  the  time  the  response  arrives,  as  well  as  the  time  when  a  transaction 
begins  blocking,  waiting  for  a  resource,  and  the  time  the  resource  is  granted.  When  a  transaction  com¬ 
mits,  it  calls  a  procedure  that  records  the  above  measures;  when  the  simulation  clock  has  expired,  these 
measures  are  printed  out  for  all  transactions. 

4.  Prototyping  Real-Time  Database  Systems 

The  previous  section  described  the  structure  of  the  prototyping  environment,  with  some  of  its 
advanced  features.  In  this  section,  we  present  real-time  database  systems  implemented  using  the  prototyp¬ 
ing  environment.  The  objectives  of  our  study  using  the  prototyping  environment  are  1)  to  evaluate  the 
prototyping  environment  itself  in  terms  of  correctness,  functionality,  and  modularity,  and  2)  performance 
comparison  between  two-phase  locking  and  priority-based  synchronization  algorithms,  and  between  a 
multiversion  database  and  its  corresponding  single-version  database,  through  the  sensitivity  study  of  key 
parameters  that  affect  performance. 

Compared  with  traditional  databases,  real-time  database  systems  have  a  distinct  feature:  they  must 
satisfy  the  timing  constraints  associated  with  transactions.  In  other  words,  "time"  is  one  of  the  key  factors 
to  be  considered  in  real-time  database  systems.  The  timing  constraints  of  a  transaction  typically  include 
its  ready  time  and  deadline,  as  well  as  temporal  consistency  of  the  data  accessed  by  it.  Transactions  must 
be  scheduled  in  such  a  way  that  they  can  be  completed  before  their  corresponding  deadlines  expire.  For 
example,  both  the  update  and  query  on  a  tracking  data  of  a  missile  must  be  processed  within  the  given 
deadlines:  otherwise,  the  information  provided  could  be  of  little  value.  In  such  a  system,  transaction 
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processing  must  satisfy  not  only  the  database  consistency  constraints  but  also  the  timing  constraints. 

The  prototyping  environment  we  have  developed  is  especially  useful  for  investigating  timing 
behavior  of  real-time  transactions,  since  we  can  control  all  the  system  components.  An  alternative  to  the 
prototyping  approach  is  to  develop  a  system  on  a  bare  machine,  based  on  a  specialized  real-time  kernel. 
The  ARTS  [Tok89]  and  the  RT-CARAT  [Hua90]  take  this  approach.  Difficulties  with  such  an  approach 
are  1)  it  takes  much  more  effort  to  develop,  2)  the  system  is  strongly  coupled  with  its  hardware  and  hence 
hard  to  change  its  timing  characteristics  when  needed,  and  3)  the  system  is  not  portable  since  it  is  imple¬ 
mented  in  the  target  environment. 

In  addition  to  providing  shared  data  access  capabilities,  distributed  real-time  database  systems  offer 
a  means  of  loosely  coupling  communicating  processes,  making  it  easier  to  rapidly  update  software,  at 
least  from  a  functional  perspective.  However,  with  respect  to  time-driven  scheduling  and  system  timing 
predictability,  they  present  new  problems.  One  of  the  characteristics  of  current  database  managers  is  that 
they  do  not  schedule  their  transactions  to  meet  response  time  requirements  and  they  commonly  lock  data 
tables  to  assure  database  consistency.  Locks  and  time-driven  scheduling  arc  basically  incompatible.  Low 
priority  transactions  holding  locks  required  by  higher  priority  transactions  can  and  will  block  the  higher 
priority  transactions,  leading  to  response  requirement  failures.  New  techniques  are  required  to  manage 
data  consistency  which  arc  compatible  with  time-driven  scheduling. 

4.1.  Steady  State  Estimation 

In  order  to  show  that  the  results  we  get  from  experiments  represent  the  performance  of  the  system  in 
steady  states,  we  have  performed  experiments  to  check  if  the  system  were  allowed  to  run  for  any  length  of 
time  greater  than  certain  iluohuiu  value,  the  variation  in  results  would  be  within  some  tolerable  interval. 
We  have  implemented  a  well-known  synchronization  protocol,  two-phase  locking  (2PL),  for  the  follow¬ 
ing  system  and  workload  configuration: 

8  sites  with  fully  interconnected  network 
multiprogramming  level  of  10 
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75%  read-only  and  25%  update  transactions 

read-only  transactions  access  3%  of  the  database 
update  transactions  access  1%  of  the  database 
database  consists  of  500  unrcplicated  objects 
Poisson  distribution  of  transaction  arrivals 

Figure  4  shows  the  average  response  time  of  transactions  using  the  2PL.  It  shows  that  the  average 
response  time  begins  to  stabilize  at  3000  simulation  time  units,  and  varies  only  slightly  from  then  on.  The 
lower  response  time  up  to  3000  time  units  arc  due  to  the  first  set  of  transactions  that  benefits  from  a  lower 
initial  multiprogramming  level  and  potential  conflicts.  In  addition,  since  transactions  requiring  longer 
execution  time  will  increase  the  average  response  time  when  they  complete,  they  do  not  contribute  to  the 
average  response  time  during  the  early  stage  of  transaction  execution  if  they  were  in  the  initial  group  of 
transaction.  These  initial  characteristics  are  gradually  erased  from  the  average  performance. 

In  addition,  as  we  increase  the  time  for  experiments,  the  average  response  lime  is  determined  from 
an  increasing  number  of  transactions.  For  example,  at  100  time  units,  the  number  of  transactions  contri¬ 
buting  to  the  mean  is  approximately  12.  At  4000,  it  is  approximately  60.  Thus  the  overall  behavior  of  the 
system  becomes  less  and  less  subject  to  the  behavior  of  individual  transactions.  From  the  graph  and 
characteristics  of  our  environment,  we  concluded  that  an  experiment  must  run  at  least  3500  time  units 


Fig.  4.  Rcspoase  time  stability 
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before  it  starts  to  capture  the  steady  state  behavior  of  the  system. 

4.2.  Priority-Based  Synchronization 

Real-time  databases  are  often  used  by  applications  such  as  tracking.  Tasks  in  such  applications  con¬ 
sist  of  both  computing  (signal  processing)  and  database  accessing  (transactions).  A  task  can  have  multi¬ 
ple  transactions,  which  consists  of  a  sequence  of  read  and  write  operations  operating  on  the  database. 
Each  transaction  will  follow  the  two-phase  locking  protocol,  which  requires  a  transaction  to  acquire  all 
the  locks  before  it  releases  any  lock.  Once  a  transaction  releases  a  lock,  it  cannot  acquire  any  new  lock.  A 
high  priority  task  will  preempt  the  execution  of  lower  priority  tasks  unless  it  is  blocked  by  the  locking 
protocol  at  the  database. 

In  a  real-time  database  system,  synchronization  protocols  must  not  only  maintain  the  consistency 
constraints  of  the  database  but  also  satisfy  the  timing  requirements  of  the  transactions  accessing  the  data¬ 
base.  To  satisfy  both  the  consistency  and  real-time  constraints,  there  is  the  need  iO  integrate  synchroniza¬ 
tion  protocols  with  real-time  priority  scheduling  protocols.  A  major  source  of  problems  in  integrating  the 
two  protocols  is  the  lack  of  coordination  in  the  development  of  synchronization  protocols  and  real-time 
priority  scheduling  protocols.  Due  to  the  effect  of  blocking  in  lock-based  synchronization  protocols,  a 
direct  application  of  a  real-time  scheduling  algorithm  to  transactions  may  result  in  a  condition  known  as 
priority  inversion  [Raj89],  Priority  inversion  is  said  to  occur  when  a  higher  priority  process  is  forced  to 
wait  for  the  execution  of  a  lower  priority  process  for  an  indefinite  period  of  time.  When  the  transactions 
of  two  processes  attempt  to  access  the  same  data  object,  the  access  must  be  serialized  to  maintain  con¬ 
sistency.  If  the  transaction  of  the  higher  priority  process  gains  access  first,  then  the  proper  priority  order  is 
maintained;  however,  if  the  transaction  of  the  lower  priority  gains  access  first  and  then  the  higher  priority 
transaction  requests  access  to  the  data  object,  this  higher  priority  process  will  be  blocked  until  the  lower 
priority  transaction  completes  its  access  to  the  data  object.  Priority  inversion  is  inevitable  in  transaction 
systems.  However,  to  achieve  a  high  degree  of  schedulability  in  real-time  applications,  priority  inversion 
must  be  minimized.  This  is  illustrated  by  the  following  example. 
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Example:  Suppose  Tj,  T2,  and  T3  are  three  transactions  arranged  in  descending  order  of  priority 
with  T |  having  the  highest  priority.  Assume  that  Tj  and  T3  access  the  same  data  object  O,.  Suppose  that 
at  time  q  transaction!^  obtains  a  lock  on  O,.  During  the  execution  of  T3,  the  high  priority  transaction  T, 
arrives,  preempts  T3  and  later  attempts  to  access  the  object  O, .  Transaction  T,  will  be  blocked,  since  0,  is 
already  locked.  We  would  expect  that  Tj ,  being  the  highest  priority  transaction,  will  be  blocked  no  longer 
than  the  time  for  transaction  T3  to  complete  and  unlock  0*.  However,  the  duration  of  blocking  may,  in 
fact,  be  unpredictable.  This  is  because  transaction  T3  can  be  blocked  by  the  intermediate  priority  transac¬ 
tion  T2  that  does  not  need  to  access  0;.  The  blocking  of  T3,  and  hence  that  of  Tt,  will  continue  until  T2 
and  any  other  pending  intermediate  priority  level  transactions  are  completed. 

The  blocking  duration  in  the  example  above  can  be  arbitrarily  long.  This  situation  can  be  partially 
remedied  if  transactions  are  not  allowed  to  be  preempted;  however,  this  solution  is  only  appropriate  for 
very  short  transactions,  because  it  creates  unnecessary  blocking.  For  instance,  once  a  long  low  priority 
transaction  starts  execution,  a  high  priority  transaction  not  requiring  access  to  the  same  set  of  data  objects 
may  be  needlessly  blocked. 

An  approach  to  this  problem,  based  on  the  notion  of  priority  inheritance,  has  been  proposed 
[Sha871.  The  basic  idea  of  priority  inheritance  is  that  when  a  transaction  T  of  a  process  blocks  higher 
priority  processes,  it  executes  at  the  highest  priority  of  all  the  transactions  blocked  by  T.  This  simple  idea 
of  priority  inheritance  reduces  the  blocking  time  of  a  higher  priority  transaction.  However,  this  is  inade¬ 
quate  because  the  blocking  duration  for  a  transaction,  though  bounded,  can  still  be  substantial  due  to  the 
potential  chain  of  blocking.  For  instance,  suppose  that  transaction  T x  needs  to  sequentially  access  objects 
O)  and  O2.  Also  suppose  that  T2  preempts  T3  which  has  already  locked  02.  Then.  T2  locks  Ot.  Transac¬ 
tion  T|  arrives  at  this  instant  and  finds  that  the  objects  Oi  and  02  have  been  respectively  locked  by  die 
lower  priority  transactions  T2  and  T3.  As  a  result,  T|  would  be  blocked  for  the  duration  of  two  transac¬ 
tions,  once  to  wait  for  T2  to  release  Cb  and  again  to  wait  for  T3  to  release  02.  Thus  a  chain  of  blocking 
can  be  formed. 
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One  idea  for  dealing  with  this  inadequacy  is  to  use  a  total  priority  ordering  of  active  transactions 
[Sha88].  A  transaction  is  said  to  be  active  if  it  has  started  but  not  yet  completed  its  execution.  A  transac¬ 
tion  can  be  active  in  one  of  two  states:  executing  or  being  preempted  in  the  middle  of  its  execution.  The 
idea  of  total  priority  ordering  is  that  the  real-time  locking  protocol  ensures  that  each  active  transaction  is 
executed  at  some  priority  level,  taking  priority  inheritance  and  read/write  semantics  into  consideration. 

4J.  Total  Ordering  by  Priority  Ceiling 

To  ensure  the  total  priority  ordering  of  active  transactions,  three  priority  ceilings  are  defined  for 
each  data  object  in  the  database:  the  write-priority  ceiling,  the  absolute-priority  ceiling,  and  the  rw- 
priority  ceiling.  The  write-priority  ceiling  of  a  data  object  is  defined  as  the  priority  of  the  highest  priority 
transaction  that  may  write  into  this  object,  and  absolute-priority  ceiling  is  defined  as  the  priority  of  the 
highest  priority  transaction  that  may  read  or  write  the  data  object.  The  rw-priority  ceiling  is  set  dynami¬ 
cally.  When  a  data  object  is  write-locked,  the  rw-priority  ceiling  of  this  data  object  is  defined  to  be  equal 
to  the  absolute  priority  ceiling.  When  it  is  re3d-locked,  the  rw-priority  ceiling  of  this  data  object  is 
defined  to  be  equal  to  the  write-priority  ceiling. 

The  priority  ceiling  protocol  is  premised  on  systems  with  a  fixed  priority  scheme.  The  protocol 
consists  of  two  mechanisms:  priority  inheritance  and  priority  ceiling.  With  the  combination  of  these  two 
mechanisms,  we  get  the  properties  of  freedom  from  deadlock  and  a  worst  case  blocking  of  at  most  a  sin¬ 
gle  lower  priority  transaction. 

When  a  transaction  attempts  to  lock  a  data  object,  the  transaction’s  priority  is  compared  with  the 
highest  rw-priority  ceiling  of  all  data  objects  currently  locked  by  other  transactions.  If  the  priority  of  the 
transaction  is  not  higher  than  the  rw-priority  ceiling,  the  access  request  will  be  denied,  and  the  transaction 
will  be  blocked.  In  this  case,  the  transaction  is  said  to  be  blocked  by  the  transaction  which  holds  the  lock 
on  the  data  object  of  the  highest  rw-priority  ceiling.  Otherwise,  it  is  granted  the  lock.  In  the  denied  case, 
the  priority  inheritance  is  performed  in  order  to  overcome  the  problem  of  uncontrolled  priority  inversion. 
For  example,  if  transaction  T  blocks  higher  transactions,  T  inherits  PfJ ,  the  highest  priority  of  the 
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transactions  blocked  by  T. 


Under  this  protocol,  it  is  not  necessary  to  check  for  the  possibility  of  read-write  conflicts.  For 
instance,  when  a  data  object  is  write-locked  by  a  transaction,  the  rw-priority  ceiling  is  equal  to  the  highest 
priority  transaction  that  can  access  it.  Hence,  the  protocol  will  block  a  higher  priority  transaction  that  may 
write  or  read  it.  On  the  other  hand,  when  the  data  object  is  read-locked,  the  rw-priority  ceiling  is  equal  to 
the  highest  priority  transaction  that  may  write  it.  Hence,  a  transaction  that  attempts  to  write  it  will  have  a 
priority  no  higher  than  the  rw-priority  ceiling  and  will  be  blocked.  Only  the  transaction  that  read  it  and 
have  priority  higher  than  the  rw-priority  ceiling  will  be  allowed  to  read-lock  it,  since  read-locks  are  com¬ 
patible.  Using  the  priority  ceiling  protocol,  mutual  deadlock  of  transactions  cannot  occur  and  each  tran¬ 
saction  can  be  blocked  by  at  most  by  one  lower  priority  transactions  until  it  completes  or  suspends  itself. 
The  next  example  shows  how  transactions  are  scheduled  under  the  priorit}  ceiling  protocol. 

Example:  Consider  the  same  situation  as  in  the  previous  example.  According  to  the  protocol,  the 
priority  ceiling  of  O,  is  the  priority  of  Tt.  When  T2  tries  to  access  a  data  object,  it  is  blocked  because  its 
priority  is  not  higher  than  the  priority  ceiling  of  O,.  Therefore  T]  will  be  blocked  only  once  by  T3  to 
access  O,,  regardless  of  the  number  of  data  objects  it  may  access. 

The  total  priority  ordering  of  active  transactions  leads  to  some  interesting  behavior.  As  shown  in  the 
example  above,  the  priority  ceiling  protocol  may  forbid  a  transaction  from  locking  an  unlocked  data 
object.  At  first  sight,  this  seems  to  introduce  unnecessary  blocking.  However,  this  can  be  considered  as 
the  "insurance  premium"  for  preventing  deadlock  and  achieving  block-at-most-once  property. 

Using  the  prototyping  environment,  we  have  investigated  issues  associated  with  this  idea  of  total 
ordering  in  priority-based  scheduling  protocols.  One  of  the  critical  issues  related  to  the  total  ordering 
approach  is  its  performance  compared  with  other  design  alternatives.  In  other  words,  it  is  important  to 
figure  out  what  is  the  actual  cost  for  the  "insurance  premium"  of  the  total  priority  ordering  approach. 
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4.4.  Performance  Evaluation 


Various  statistics  have  been  collected  for  comparing  the  performance  of  the  priority-ceiling  protocol 
with  other  synchronization  control  algorithms.  Transaction  are  generated  with  exponentially  distributed 
interarrival  times,  and  the  data  objects  updated  by  a  transaction  are  chosen  uniformly  from  the  database. 
A  transaction  has  an  execution  profile  which  alternates  data  access  requests  with  equal  computation 
requests,  and  some  processing  requirement  for  termination  (either  commit  or  abort).  Thus  the  total  pro¬ 
cessing  time  of  a  transaction  is  directly  related  to  the  number  of  data  objects  accessed.  Due  to  space  con¬ 
siderations,  we  do  not  present  all  our  results  but  have  selected  the  graphs  which  best  illustrate  the  differ¬ 
ence  and  performance  of  the  algorithms.  For  example,  we  have  omitted  the  results  of  an  experiment  that 
varied  the  size  of  the  database,  and  thus  the  number  of  conflicts,  because  they  only  confirm  and  not 
increase  the  knowledge  yielded  by  other  experiments. 

For  each  experiment  and  for  each  algorithm  tested,  we  collected  performance  statistics  and  averaged 
over  the  10  runs.  The  percentage  of  deadline-missing  transactions  is  calculated  with  the  following  equa¬ 
tion:  %  missed  =  100  *  (number  of  deadline-missing  transactions  /  number  of  transactions  processed).  A 
transaction  is  processed  if  either  it  executes  completely  or  it  is  aborted.  We  assume  that  all  the  transac¬ 
tions  are  hard  in  the  sense  that  there  will  be  no  value  for  completing  the  transaction  after  its  deadline. 
Transactions  that  miss  the  deadline  are  aborted,  and  disappeared  from  the  system  immediately  with  some 
abort  cost.  We  have  used  the  transaction  size  (the  number  of  data  objects  a  transaction  needs  to  access)  as 
one  of  the  key  variables  in  the  experiments.  It  varies  from  a  small  fraction  up  to  a  relatively  large  portion 
(10%)  of  the  database  so  that  conflict  would  occur  frequently.  The  high  conflict  rate  allows  synchroniza¬ 
tion  protocols  to  play  a  significant  role  in  the  system  performance.  We  chcose  the  arrival  rate  so  that  pro¬ 
tocols  are  tested  in  a  heavily  loaded  rather  than  lightly  loaded  system.  It  is  because  for  designing  real¬ 
time  systems,  one  must  consider  high  load  situations.  Even  though  they  may  not  arise  frequently,  one 
would  like  to  have  a  system  that  misses  as  few  deadlines  as  possible  when  such  peaks  occur.  In  other 
words,  when  a  crisis  occurs  and  the  database  system  is  under  pressure  is  precisely  when  making  a  few 
extra  deadlines  could  be  most  important  [Abb88], 
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We  normalize  the  transaction  throughput  in  records  accessed  per  second  for  successful  transactions, 
not  in  transactions  per  second,  in  order  iO  account  for  the  fact  that  bigger  transactions  need  more  database 
processing.  The  normalization  rate  is  obtained  by  multiplying  the  transaction  completion  rate 
(transactions/second)  by  the  transaction  size  (database  records  accesscd/transaction). 

In  Figure  5,  the  throughput  of  the  priority-ceiling  protocol  (C),  the  two-phase  locking  protocol  with 
priority  mode  (P),  and  the  two-phase  locking  protocol  without  priority  mode  (L),  is  shown  for  transac¬ 
tions  of  different  sizes  with  balanced  workload  and  I/O  bound  workload.  The  two  important  factors 
affecting  the  performance  of  locking  protocols  are  their  abilities  to  resolve  the  locking  conflicts  and  to 
perform  I/O  and  transactions  in  parallel.  When  the  transaction  size  is  small,  there  is  little  locking  conflict 
and  the  problem  such  as  deadlock  and  priority  inversion  has  little  effect  on  the  overall  performance  of  a 
locking  protocol.  On  the  other  hand,  when  transaction  size  becomes  large,  the  probability  of  locking 
conflicts  rises  rapidly.  In  fact,  the  probability  of  deadlocks  goes  up  with  the  fourth  power  of  the  transac¬ 
tion  size  [Gray81],  Hence,  we  would  expect  that  the  performance  of  protocols  will  be  dominated  by  their 
abilities  to  handle  locking  conflicts  when  transaction  size  is  large. 

As  illustrated  in  Figure  5,  the  performance  of  the  two-phase  locking  protocol,  with  or  without  prior¬ 
ity  assignments  to  transactions,  degrades  very  fast  when  transaction  size  increases.  This  can  be  attributed 
to  the  inability  of  this  protocol  to  prevent  deadlock  and  priority  inversions.  On  the  other  hand,  the  priority 
ceiling  protocol  handles  locking  conflicts  very  well.  The  protocol  performs  much  better  than  the  two- 
phase  locking  protocol  when  the  transaction  size  is  large.  The  main  weakness  of  the  priority  ceiling  proto¬ 
col  is  its  inability  to  perform  I/O  and  transactions  in  parallel.  For  example,  suppose  that  transaction  T  has 
lock  on  Ot  and  it  now  wants  to  lock  data  object  02.  Unfortunately,  02  is  not  in  the  main  memory.  As  a 
result,  T  is  suspended.  However,  neither  are  transactions  with  priorities  lower  than  the  rw-priority  ceiling 
of  0|  allowed  to  execute.  This  could  lead  to  the  idling  of  the  processor  until  either  02  is  transferred  to  the 
main  memory  or  a  transaction  whose  priority  is  higher  than  the  rw-priority  ceiling  arrives.  We  refer  this 
type  of  blocking  as  I/O  blocking.  When  transaction  size  is  small,  the  locking  conflict  rate  is  small.  Hence, 
the  two-phase  locking  protocol  performs  well.  However,  due  to  I/O  blocking  the  throughput  of  the 
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priority  ceiling  protocol  is  not  as  good  as  that  of  the  two-phase  locking  protocol,  especially  when  the 
workload  is  I/O  bounded. 

Since  I/O  cost  is  one  of  the  key  parameters  in  determining  performance,  we  have  investigated  an 
approach  to  improve  system  performance  by  performing  I/O  operation  before  locking.  This  is  called  the 
intention  I/O.  In  the  intention  mode  of  I/O  operation,  the  system  pre-fetches  data  objects  that  are  in  the 
access  lists  of  transactions  submitted,  without  locking  them.  This  approach  will  reduce  the  locking  time 
of  data  objects,  resulting  in  higher  throughput.  As  shown  in  Figure  6,  intention  I/O  improves  throughput 
of  both  the  two-phase  locking  and  the  ceiling  protocol.  However,  improvement  in  the  ceiling  protocol  is 
much  more  significant.  This  is  because  intention  I/O  effectively  solves  the  I/O  blocking  problem  of  the 
priority  ceiling  protocol. 

Another  important  performance  statistics  is  the  percentage  of  deadline  missing  transactions,  since 
the  synchronization  protocol  in  real-time  database  systems  must  satisfy  the  timing  constraint  of  individual 
transaction.  In  our  experiments,  each  transaction’s  deadline  is  set  to  proportional  to  its  size  and  system 
workload  (number  of  transactions),  and  the  transaction  with  the  earliest  deadline  is  assigned  the  highest 
priority.  As  shown  in  Figure  7,  the  percentage  of  deadline  missing  transactions  increases  sharply  for  the 
two-phase  locking  protocol  as  the  transaction  size  increases  due  to  its  inability  to  deal  with  deadlock  and 
to  give  preference  to  transactions  with  shorter  deadlines.  Two-phase  locking  with  priority  assignment  per¬ 
forms  somewhat  better,  because  the  timing  constraints  of  transactions  are  considered,  although  the 
deadlock  and  priority  inversion  problems  still  handicap  its  performance.  The  priority  ceiling  protocol  has 
the  best  relative  performance  because  it  addresses  both  the  deadlock  and  priority  inversion  problem. 

A  drawback  of  the  priority  ceiling  protocol  from  the  practical  viewpoint  is  that  it  needs  knowledge 
of  all  transactions  that  will  be  executed  in  the  future.  This  may  be  a  very  strong  requirement  to  satisfy  in 
some  applications. 

The  priority  ceiling  protocol  takes  a  conservative  approach.  It  is  based  on  two-phase  locking  and 
employs  only  blocking,  but  not  roll-back,  to  solve  conflicts.  For  conventional  database  systems,  it  has 
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been  shown  that  optimal  performance  may  be  achieved  by  compromising  blocking  and  roll-back  [Yu90). 
For  real-time  database  systems,  we  may  expect  similar  results.  Aborting  a  few  low  priority  transactions 
and  restarting  them  later  may  allow  high  priority  transactions  to  meet  their  deadlines,  resulting  in 
improved  system  performance.  Several  concurrency  control  protocols  based  on  optimistic  approach  have 
been  proposed  [Har90,  Lin90,  Son90).  They  incorporate  priority-based  conflict  resolution  mechanisms, 
such  as  priority  wait,  that  makes  low  priority  transactions  wait  for  conflicting  high  priority  transactions  to 
complete.  However,  this  approach  of  detecting  conflicts  during  validation  phase  degrades  system  predic¬ 
tability.  A  transaction  is  detected  as  being  late  when  it  actually  misses  its  deadline,  since  the  transaction  is 
only  aborted  in  the  validation  phase. 

4.5.  A  Multiversion  Database  System 

To  illustrate  the  cffctivencss  of  the  prototyping  environment,  we  have  investigated  the  performance 
of  a  multiversion  database  system.  There  is  no  correlation  between  the  priority  ceiling  protocol  study  and 
multivcrsion  database  study. 

In  a  multiversion  database  system,  each  data  object  consists  of  a  number  of  consecutive  versions. 
The  objective  of  using  multiple  versions  in  real-time  database  systems  is  to  increase  the  degree  of  con¬ 
currency  and  to  reduce  the  possibility  of  rejecting  user  requests  by  providing  a  succession  of  views  of 
data  objects.  One  of  the  reasons  for  rejecting  a  user  request  is  that  its  operations  cannot  be  serviced  by  the 
system.  For  example,  a  read  operation  has  to  be  rejected  if  the  value  of  data  object  it  was  supposed  to  read 
has  already  been  overwritten  by  some  other  user  request.  Such  rejections  can  be  avoided  by  keeping  old 
versions  of  each  data  object  so  that  an  appropriate  old  value  can  be  given  to  a  tardy  read  operation.  In  a 
system  with  multiple  versions  of  data,  each  write  operation  on  a  data  object  produces  a  new  version 
instead  of  overwriting  it.  Hence,  for  each  read  operation,  the  system  selects  an  appropriate  version  to 
read,  enjoying  the  flexibility  in  controlling  the  order  of  read  and  write  operations.  When  a  new  version  is 
created,  it  is  uncertified.  Uncertified  versions  arc  prohibited  from  being  read  by  other  transactions  to 
guarantee  cascaded-abort  free  !Bcm87],  A  version  is  certified  at  the  commit  time  of  the  transaction  that 
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generated  the  version. 


The  multiversion  database  system  we  have  implemented  is  based  on  timestamp  ordering.  Each  data 
object  is  represented  as  a  list  of  versions,  and  each  version  is  associated  with  timestamps  for  its  creation 
and  the  latest  read,  and  a  valid  bit  to  specify  whether  the  version  is  certified.  The  multiversion  con¬ 
currency  control  scheme  we  have  implemented  is  called  "multivcrsion  timestamp  ordering  method",  and 
is  proved  to  satisfy  the  scrializability  [Bem87j. 

Each  transaction  consists  of  read  and  write  requests  for  data  objects.  Read  requests  are  never 
rejected  in  a  multi-version  database  system  if  all  the  versions  are  retained.  A  read  operation  does  not 
necessarily  read  the  latest  committed  version  of  a  data  object.  A  read  request  is  transformed  to  a  version- 
read  operation  by  selecting  an  appropriate  version  to  read.  The  timestamp  of  a  read  request  is  compared 
with  the  write-timestamp  of  the  highest  available  version.  When  a  read  request  with  timestamp  T  is  sent 
to  the  Resource  Manager,  the  version  of  a  data  object  with  the  largest  timestamp  less  than  T  is  selected  as 
the  value  to  be  returned.  Figure  8  shows  an  example  of  a  read  operation  with  the  timestamp  "11”. 

The  timestamp  of  a  write  request  is  compared  with  the  read  timestamp  of  the  highest  version  of  the 
data  object.  A  new  version  with  the  timestamp  greater  than  the  read-timestamp  of  the  highest  certified 
version  is  built  on  the  upper  level,  with  the  valid  bit  reset  to  indicate  that  the  new  version  is  not  certified 
yet.  In  order  to  simplify  the  concurrency  control  mechanism,  we  allow  only  one  temporary  version  for 
each  data  object.  Inserting  a  new  version  in  the  middle  of  existing  valid  versions  is  not  allowed. 


Fig.  8.  A  read  operation  with  two  certified  versions  of  a  data  object. 
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The  experiment  was  conducted  to  measure  the  average  response  time  and  the  number  of  aborts  for  a 
group  of  transactions  running  on  a  multiversion  database  system  and  its  corresponding  single-version  sys¬ 
tem.  Two  groups  of  transactions  with  different  characteristics  (e.g.,  type  and  number  of  access  to  data 
objects)  were  executed  concurrently.  The  objective  was  to  study  the  sensitivity  of  key  parameters  on 
those  two  performance  measures.  Here  we  present  our  findings  briefly. 

Performance  is  highly  dependent  on  the  set  size  of  transactions.  As  shown  in  Figure  9.  a  multiver¬ 
sion  database  system  outperforms  the  corresponding  single-version  system  for  the  type  of  workload  under 
which  they  arc  expected  to  be  beneficial:  a  mix  of  small  update  transactions  and  larger  read-only  transac¬ 
tions.  The  reason  for  this  is  that,  in  a  multivcrsion  database  system,  a  read  requests  have  higher  priority 
than  the  write  requests:  whereas  the  priority  for  read  requests  is  not  provided  in  a  single-version  system. 
Therefore,  in  a  single-version  system,  the  probability  of  rejecting  a  read  request  is  equal  to  that  of  a  write 
request.  The  experiment  shows  that  a  single-version  database  system  outperforms  its  multiversion  coun¬ 
terpart  for  a  different  transaction  mix. 

It  was  observed  that  the  performance  of  a  multiversion  system  in  terms  of  the  number  of  aborts  is 
better  than  its  single-version  counterpart  for  a  mix  of  small  update  transactions  and  larger  read-only  tran¬ 
sactions.  Similar  experiments  have  been  performed  by  changing  the  database  size  and  the  mean  interar¬ 
rival  time  of  transactions.  It  was  found,  however,  that  the  main  result  remains  the  same.  From  these 
experiments,  it  becomes  clear  that  among  the  four  variables  we  studied,  the  set-size  of  transactions  is  the 
most  sensitive  parameter  for  determining  the  performance  of  a  multiversion  database  system.  This  experi¬ 
ment  demonstrates  the  expressive  power  and  performance  evaluation  capability  of  the  prototyping 
environment. 

5.  Conclusions 

Prototyping  large  software  systems  is  not  a  new  approach.  However,  methodologies  lor  developing 
a  prototyping  environment  for  real-time  database  systems  have  not  been  investigated  in  depth  in  spite  of 
its  potential  benefits.  In  this  paper,  we  have  presented  a  prototyping  environment  that  has  been  developed 
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based  on  the  StarLite  concurrent  programming  kernel  and  message-based  approach  with  modular  building 
blocks.  Although  the  complexity  of  a  distributed  database  system  makes  prototyping  difficult,  the  imple¬ 
mentation  has  proven  satisfactory  for  experimentation  of  design  choices,  different  database  control  tech¬ 
niques,  and  even  an  integrated  evaluation  of  database  systems. 

There  are  three  main  goals  to  be  achieved  in  developing  a  prototyping  environment  for  real-time 
database  systems:  modularity,  flexibility,  and  extensibility.  Modularity  enables  the  environment  to  be 
easily  reconfigured,  since  any  subset  of  the  available  modules  can  be  combined  to  produce  a  new  testing 
environment. 

An  additional  benefit  of  the  "right"  modularity  is  that  actual  system  software  can  be  developed  in 
the  prototyping  environment  and  then  ported  to  the  target  machine.  This  is  enabled  by  the  use  of 
technology-independent  interfaces  which  are  general  enough  to  support  any  target  system  architecture.  In 
addition  to  the  portability,  programs  may  be  run  in  a  "hybrid"  mode,  that  is,  not  all  service  calls  need  be 
simulated.  For  example,  file  system  calls  in  the  application  program  can  be  intercepted  by  the  interpreter 
and  directed  to  the  existing  host  file  system.  Then,  as  a  file  system  is  developed,  the  file  system  calls  can 
be  directed  to  it.  If  the  file  system  is  not  necessary  or  is  not  the  focus  of  the  current  research,  it  need  not 
be  developed.  This  feature  of  the  prototyping  environment  allows  the  developer  to  focus  on  only  pertinent 
design  issues. 

Flexibility  enables  the  prototyping  environment  to  be  applicable  over  a  wide  range  of  configurations 
and  system  parameters.  One  of  the  keys  to  achieving  this  goal  is  to  design  interfaces  whose  operations 
are  independent  both  of  the  implementation  technology  and  the  context  in  which  they  arc  used.  For 
example,  the  user-level  Send  operation  sends  an  array  of  bytes  to  an  abstract  data  type,  the  PortTag.  Thus 
this  operation  can  be  used  to  send  any  packet  type  to  any  destination,  be  it  local  or  distant. 

The  third  goal  is  that  the  prototyping  environment  be  extensible  enough  to  mode)  additional  features 
of  particular  systems  by  adding  modules  without  affecting  the  operation  of  or  requiring  the  recompilation 
of  existing  modules.  For  instance,  the  implementation  can  be  extended  to  model  the  operation  of  dif- 
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fcrent  types  of  I/O  devices  of  different  speeds  by  modifying  the  implementation  module  that  performs  the 
read  and  write  operations.  One  way  to  modify  the  implementation  would  be  to  delay  for  a  period  depend¬ 
ing  on  the  address  passed  to  the  read  or  write  operation.  Reading  from  a  disk  might  be  indicated  by  one 
range  of  addresses  and  take  some  time,  while  reading  from  a  tape  drive  might  be  indicated  by  another 
range  and  presumably  take  longer.  However,  because  the  interface  of  this  module  is  device-independent, 
changing  the  implementation  to  process  I/O  requests  at  different  speed  will  not  affect  any  of  the  modules 
that  request  I/O  operations.  Therefore,  time  and  effort  for  system  reconfiguration  can  be  reduced. 

Expressive  power  and  performance  evaluation  capability  of  our  prototyping  environment  has  been 
demonstrated  by  implementing  real-time  database  systems  and  investigating  the  performance  characteris¬ 
tics  of  the  priority -ceiling  protocol  and  multivcrsion  databases. 

In  real-time  database  systems,  transactions  must  be  scheduled  to  meet  their  timing  constraints.  In 
addition,  the  system  should  support  a  predictable  behavior  such  that  the  possibility  of  missing  deadlines 
of  critical  tasks  could  be  informed  ahead  of  time,  before  their  deadlines  expire.  Priority  ceiling  protocol 
is  one  approach  to  achieve  a  high  degree  of  schcdulability  and  system  predictability.  In  this  paper,  we 
have  investigated  this  approach  and  compared  its  performance  with  other  techniques  and  design  choices. 
It  is  shown  that  this  technique  might  be  appropriate  for  real-time  transaction  scheduling  since  it  is  very 
stable  over  the  wide  range  of  transaction  sizes,  and  compared  with  two-phase  locking  protocols,  it  reduces 
the  number  of  deadline-missing  transactions. 

Using  the  prototyping  environment,  we  have  shown  that  in  general,  a  database  system  with  a  mul¬ 
tivcrsion  concurrency  control  algorithm  performs  better  while  processing  read  requests.  Read  requests 
that  would  be  aborted  in  a  single-version  database  system  due  to  conllicts  may  be  successfully  processed 
in  a  multiversion  system  using  older  versions.  Therefore,  when  the  read  requests  dominate  the  transaction 
load,  and  there  is  a  high  probability  for  abort  of  read-only  transactions  due  to  conflicts,  a  multivcrsion 
system  outperforms  its  corresponding  single-version  system.  The  relative  sizes  of  the  read  and  wnte  sets 
of  transactions  is  an  important  factor  affecting  the  performance.  Although  the  actual  performance  figures 
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will  vary  depending  on  workload  and  implementation  details,  we  believe  that  our  results  provide  a  good 
picture  of  the  costs  and  benefits  associated  with  the  multi  version  approach  to  concurrency  control. 

Real-time  distributed  database  systems  need  further  investigation.  In  priority  ceiling  protocol  and 
many  other  database  scheduling  algorithms,  preemption  is  usually  not  allowed.  To  reduce  the  number  of 
deadline-missing  transactions,  however,  preemption  may  need  to  be  considered.  The  preemption  decision 
in  a  real-time  database  system  must  be  made  very  carefully,  and  as  pointed  out  in  [Stan88],  it  should  not 
necessarily  based  only  on  relative  deadlines.  Since  preemption  implies  not  only  that  the  work  done  by 
the  preempted  transaction  must  be  undone,  but  also  that  later  on,  if  restarted,  must  redo  the  work.  The 
resultant  delay  and  the  wasted  execution  may  cause  one  or  both  of  these  transactions,  as  well  as  other 
transaction  to  miss  the  deadlines.  Several  approaches  to  designing  scheduling  algorithms  for  real-time 
transactions  have  been  proposed  [Liu87,  Stan88,  Abb88],  but  their  performance  in  distributed  environ¬ 
ments  is  not  studied.  The  prototyping  environment  described  in  this  paper  is  an  appropriate  research 
vehicle  for  investigating  such  new  techniques  and  scheduling  algorithms  for  real-time  database  systems. 
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Fig.  9.  Average  transaction  response  time 
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